PARQUET-3479: Add configuration to disable early dictionary compression check#3556
PARQUET-3479: Add configuration to disable early dictionary compression check#3556yadavay-amzn wants to merge 1 commit into
Conversation
ddf1332 to
2609cdc
Compare
|
@Fokko Could you take a look? This adds a config ( |
2609cdc to
287f352
Compare
|
Thanks for looking into this! While the problem is real, introducing a boolean flag Instead of a new config, could we make the heuristic more adaptive? For example, we could delay the compression check until we've accumulated a certain amount of raw data (e.g., This would solve the issue out of the box without hurting usability. Thoughts? |
287f352 to
c7cf83e
Compare
|
Thanks for taking a look @wgtmac ! I like the idea of having a threshold on the raw data to delay the compression check. Updated the check in I think the page count approach (checking after N pages) is sensitive to page size configurations. So with 1KB pages you'd need many pages to accumulate meaningful data, while with 1MB pages a single page might be enough. A byte threshold adapts naturally regardless of page size settings. That being said I'd like to know which approach you'd prefer and iterate based on it. |
|
Thanks for the quick update. Before diving into the code detail, I have some additional questions:
cc @gszadovszky @Fokko for additional advice. |
c7cf83e to
3fcc6c8
Compare
|
@wgtmac thanks for spotting that. Updated:
Let me know if this direction works or if you'd prefer a different approach. Thanks again for the prompt reviews! |
| public BytesInput getBytes() { | ||
| if (!fellBackAlready && firstPage) { | ||
| // we use the first page to decide if we're going to use this encoding | ||
| if (!fellBackAlready && !compressionChecked && cumulativeRawBytes >= checkAfterBytes) { |
There was a problem hiding this comment.
Could we add a test where the threshold is crossed only after a reset()? This is the exact case we want to protect here, since rawDataByteSize is reset per page while cumulativeRawBytes should keep accumulating across pages. The current tests with 0 and Long.MAX_VALUE would not catch a regression back to per-page counting.
| /** | ||
| * Set the raw data byte threshold after which the dictionary compression check is performed. | ||
| * | ||
| * @param val byte threshold (0 means check on every page) |
There was a problem hiding this comment.
Nit: this says 0 means check on every page, but the implementation only checks once per column chunk, same as the old first-page behavior. Could we update this to say 0 means check on the first page / preserve the previous behavior?
| public static final String BLOCK_ROW_COUNT_LIMIT = "parquet.block.row.count.limit"; | ||
| public static final String PAGE_ROW_COUNT_LIMIT = "parquet.page.row.count.limit"; | ||
| public static final String PAGE_WRITE_CHECKSUM_ENABLED = "parquet.page.write-checksum.enabled"; | ||
| public static final String DICTIONARY_CHECK_AFTER_BYTES = "parquet.dictionary.check.after.raw.bytes"; |
There was a problem hiding this comment.
Could we also document parquet.dictionary.check.after.raw.bytes in the configuration list above? It would be useful to mention that this is based on raw value bytes, and nulls encoded in definition levels do not contribute to this threshold.
| * @return this builder for method chaining | ||
| */ | ||
| public Builder withDictionaryCheckAfterBytes(long val) { | ||
| this.dictionaryCheckAfterBytes = val; |
There was a problem hiding this comment.
Should we reject negative values here? A negative threshold effectively behaves like 0, but accepting it silently seems a bit confusing for a size-like config. Most nearby size/count options validate the input, so val >= 0 would be clearer.
| @Override | ||
| public void writeByte(int value) { | ||
| rawDataByteSize += 1; | ||
| cumulativeRawBytes += 1; |
There was a problem hiding this comment.
Should we use cumulativeRawBytes + rawDataByteSize for checking and only increase cumulativeRawBytes once a page to save some cycles?
| public static final String BLOCK_ROW_COUNT_LIMIT = "parquet.block.row.count.limit"; | ||
| public static final String PAGE_ROW_COUNT_LIMIT = "parquet.page.row.count.limit"; | ||
| public static final String PAGE_WRITE_CHECKSUM_ENABLED = "parquet.page.write-checksum.enabled"; | ||
| public static final String DICTIONARY_CHECK_AFTER_BYTES = "parquet.dictionary.check.after.raw.bytes"; |
There was a problem hiding this comment.
TBH, the flag name still looks a little bit unclear to me. How about renaming it to parquet.dictionary.check.threshold.raw.size.bytes and also change associated variable and function names?
|
I don't think that integration tests validating the result files metadata should be a concern. Parquet-java should have the freedom to change it's default behavior related to encoding/page limits etc. until according to the specification and the configured properties. These are not breaking changes. I think we should make our changes as automatic and configuration free as possible. Introducing new configuration with staying with the current default behavior would lead to nobody actually using it. I'm OK with having the default value so the default behavior is unchanged until we want to benchmark the system in prod so we can come up with better defaults. WDYT? |
5aa608f to
fefdf41
Compare
fefdf41 to
2be0d8b
Compare
|
@wgtmac Addressed all comments:
Thanks for the thorough review! |
Problem
FallbackValuesWritercallsisCompressionSatisfying()after the first page to decide whether dictionary encoding is worthwhile. With modern page-index defaults (~20k rows per page), this check fires too early for moderate-cardinality columns — dictionary encoding gets abandoned before enough data has accumulated to show its benefit, resulting in significantly larger files.As reported in #3479, a column with 1M int64 values mod 32768 produces 8.4MB with the premature fallback vs 2.2MB when dictionary encoding is preserved.
Fix
Add a configurable property
ParquetProperties.isDictionaryEarlyCheckEnabled()(default:truefor backward compatibility) that controls whether the first-page compression check is performed inFallbackValuesWriter.getBytes().When disabled, dictionary encoding is only abandoned when the dictionary itself exceeds size limits (
shouldFallBack()), not based on the first-page compression ratio.Changes
ParquetProperties: addeddictionaryEarlyCheckEnabledfield, getter, and builder methodFallbackValuesWriter: added overloadedof()factory and constructor accepting the flag; guarded theisCompressionSatisfyingcallDefaultValuesWriterFactory: passes the config through toFallbackValuesWriter.of()TestFallbackValuesWriter: verifies dictionary encoding is preserved when the check is disabledTesting
parquet-columntests unaffected (defaulttruepreserves existing behavior)