Skip to content

Commit 766e002

Browse files
committed
Add Parquet optimizations as a fundable project
1 parent 9d2e611 commit 766e002

4 files changed

Lines changed: 67 additions & 2 deletions

File tree

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
#### Overview
2+
3+
Apache Parquet is an open source, column-oriented data file format designed for
4+
efficient data storage and retrieval. Together with Apache Arrow for in-memory data,
5+
it has become for the de facto standard for efficient columnar analytics.
6+
7+
While Parquet and Arrow are most often used together, they have incompatible physical
8+
representations of data with optional values: data where some values can be
9+
missing or "null". While Arrow uses a validity bitmap for each schema field and nesting level,
10+
Parquet condenses that information in a more sophisticated structure called definition
11+
levels (borrowing ideas from Google's Dremel project).
12+
13+
Converting between those two representations is non-trivial and often turns out
14+
a performance bottleneck when reading a Parquet file as in-memory Arrow data.
15+
Even columns that practically do not contain any nulls can still suffer from it if
16+
the data is declared nullable (optional) at the schema level.
17+
18+
We propose to optimize the conversion of null values from Parquet in Arrow C++
19+
for flat (non-nested) data:
20+
21+
1. decoding Parquet definition levels directly into a Arrow validity bitmap, rather than using an
22+
intermediate representation as 16-bit integers;
23+
24+
2. avoiding decoding definition levels entirely when a data page's statistics shows
25+
it cannot contain any nulls (or, conversely, when it cannot contain any non-null values).
26+
27+
This work can optionally be extended so as to apply to schemas with moderate amounts
28+
of nesting.
29+
30+
Depending on the typology of Parquet data, this could make Parquet reading 2x
31+
faster, even more in some cases. If you are ensure whether your workload could
32+
benefit, we can discuss this based on sample Parquet files you provide us.
33+
34+
##### Are you interested in this project? Either entirely or partially, contact us for more information on how to help us fund it

src/components/fundable/projectsDetails.ts

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@ import EmscriptenForgePackageRequestsMD from "@site/src/components/fundable/desc
55
import SVE2SupportInXsimdMD from "@site/src/components/fundable/descriptions/SVE2SupportInXsimd.md"
66
import MatrixOperationsInXtensorMD from "@site/src/components/fundable/descriptions/MatrixOperationsInXtensor.md"
77
import BinaryViewInArrowCppMD from "@site/src/components/fundable/descriptions/BinaryViewInArrowCpp.md"
8-
import Decimal32InArrowCppMD from"@site/src/components/fundable/descriptions/Decimal32InArrowCpp.md"
9-
import Float16InArrowCppMD from"@site/src/components/fundable/descriptions/Float16InArrowCpp.md"
8+
import Decimal32InArrowCppMD from "@site/src/components/fundable/descriptions/Decimal32InArrowCpp.md"
9+
import Float16InArrowCppMD from "@site/src/components/fundable/descriptions/Float16InArrowCpp.md"
10+
import ParquetNullOptimizationsMD from "@site/src/components/fundable/descriptions/ParquetNullOptimizations.md"
1011

1112
export const fundableProjectsDetails = {
1213
jupyterEcosystem: [
@@ -125,6 +126,18 @@ export const fundableProjectsDetails = {
125126
currentNbOfFunders: 0,
126127
currentFundingPercentage: 0,
127128
repoLink: "https://github.com/apache/arrow"
129+
},
130+
{
131+
category: "Apache Arrow and Parquet",
132+
title: "Parquet C++ reader optimizations",
133+
pageName: "ParquetNullOptimizations",
134+
shortDescription: "Converting Parquet optional values to nullable Arrow data is often a performance bottleneck.",
135+
description: ParquetNullOptimizationsMD,
136+
price: "TBD",
137+
maxNbOfFunders: 1,
138+
currentNbOfFunders: 0,
139+
currentFundingPercentage: 0,
140+
repoLink: "https://github.com/apache/arrow"
128141
}
129142
]
130143

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
import useDocusaurusContext from '@docusaurus/useDocusaurusContext';
2+
import GetAQuotePage from '@site/src/components/fundable/GetAQuotePage';
3+
4+
export default function FundablePage() {
5+
const { siteConfig } = useDocusaurusContext();
6+
return (
7+
<GetAQuotePage/>
8+
);
9+
}
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
import useDocusaurusContext from '@docusaurus/useDocusaurusContext';
2+
import LargeProjectCardPage from '@site/src/components/fundable/LargeProjectCardPage';
3+
4+
export default function FundablePage() {
5+
const { siteConfig } = useDocusaurusContext();
6+
return (
7+
<LargeProjectCardPage/>
8+
);
9+
}

0 commit comments

Comments
 (0)