Skip to content

Commit 5214c2b

Browse files
Merge pull request #330 from pitrou/parquet-fundable-optimizations
Add Parquet optimizations as a fundable project
2 parents 4fe9dd6 + f15e76f commit 5214c2b

4 files changed

Lines changed: 70 additions & 2 deletions

File tree

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#### Overview
2+
3+
Apache Parquet is an open source, column-oriented data file format designed for
4+
efficient data storage and retrieval. Together with Apache Arrow for in-memory data,
5+
it has become for the *de facto* standard for efficient columnar analytics.
6+
7+
While Parquet and Arrow are most often used together, they have incompatible physical
8+
representations of data with optional values: data where some values can be
9+
missing or "null". While Arrow uses a validity bitmap for each schema field and nesting level,
10+
Parquet condenses that information in a more sophisticated structure called definition
11+
levels (borrowing ideas from Google's Dremel project).
12+
13+
Converting between those two representations is non-trivial and often turns out
14+
a performance bottleneck when reading a Parquet file as in-memory Arrow data.
15+
Even columns that practically do not contain any nulls can still suffer from it if
16+
the data is declared nullable (optional) at the schema level.
17+
18+
We propose to optimize the conversion of null values from Parquet in Arrow C++
19+
for flat (non-nested) data:
20+
21+
1. decoding Parquet definition levels directly into a Arrow validity bitmap, rather than using an
22+
intermediate representation as 16-bit integers;
23+
24+
2. avoiding decoding definition levels entirely when a data page's statistics shows
25+
it cannot contain any nulls (or, conversely, when it cannot contain any non-null values).
26+
27+
As a subsequent task, these optimizations may be extended so as to apply to schemas
28+
with moderate amounts of nesting.
29+
30+
This work will benefit to applications using Arrow C++ or any of its language
31+
bindings (such as PyArrow, R-Arrow...).
32+
33+
Depending on the typology of Parquet data, this could make Parquet reading 2x
34+
faster, even more in some cases. If you are unsure whether your workload could
35+
benefit, we can discuss this based on sample Parquet files you provide us.
36+
37+
##### Are you interested in this project? Either entirely or partially, contact us for more information on how to help us fund it

src/components/fundable/projectsDetails.ts

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,9 @@ import SVE2SupportInXsimdMD from "@site/src/components/fundable/descriptions/SVE
66
import Float16SupportInXsimdMD from "@site/src/components/fundable/descriptions/Float16SupportInXsimd.md"
77
import MatrixOperationsInXtensorMD from "@site/src/components/fundable/descriptions/MatrixOperationsInXtensor.md"
88
import BinaryViewInArrowCppMD from "@site/src/components/fundable/descriptions/BinaryViewInArrowCpp.md"
9-
import Decimal32InArrowCppMD from"@site/src/components/fundable/descriptions/Decimal32InArrowCpp.md"
10-
import Float16InArrowCppMD from"@site/src/components/fundable/descriptions/Float16InArrowCpp.md"
9+
import Decimal32InArrowCppMD from "@site/src/components/fundable/descriptions/Decimal32InArrowCpp.md"
10+
import Float16InArrowCppMD from "@site/src/components/fundable/descriptions/Float16InArrowCpp.md"
11+
import ParquetNullOptimizationsMD from "@site/src/components/fundable/descriptions/ParquetNullOptimizations.md"
1112

1213
export const fundableProjectsDetails = {
1314
jupyterEcosystem: [
@@ -138,6 +139,18 @@ export const fundableProjectsDetails = {
138139
currentNbOfFunders: 0,
139140
currentFundingPercentage: 0,
140141
repoLink: "https://github.com/apache/arrow"
142+
},
143+
{
144+
category: "Apache Arrow and Parquet",
145+
title: "Parquet reader optimizations",
146+
pageName: "ParquetNullOptimizations",
147+
shortDescription: "Converting Parquet optional values to nullable Arrow data is often a performance bottleneck. We will optimize that step for the most common cases.",
148+
description: ParquetNullOptimizationsMD,
149+
price: "TBD",
150+
maxNbOfFunders: 1,
151+
currentNbOfFunders: 0,
152+
currentFundingPercentage: 0,
153+
repoLink: "https://github.com/apache/arrow"
141154
}
142155
]
143156

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
import useDocusaurusContext from '@docusaurus/useDocusaurusContext';
2+
import GetAQuotePage from '@site/src/components/fundable/GetAQuotePage';
3+
4+
export default function FundablePage() {
5+
const { siteConfig } = useDocusaurusContext();
6+
return (
7+
<GetAQuotePage/>
8+
);
9+
}
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
import useDocusaurusContext from '@docusaurus/useDocusaurusContext';
2+
import LargeProjectCardPage from '@site/src/components/fundable/LargeProjectCardPage';
3+
4+
export default function FundablePage() {
5+
const { siteConfig } = useDocusaurusContext();
6+
return (
7+
<LargeProjectCardPage/>
8+
);
9+
}

0 commit comments

Comments
 (0)