Skip to content

PARQUET-3479: Add configuration to disable early dictionary compression check#3556

Open
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/parquet_3479
Open

PARQUET-3479: Add configuration to disable early dictionary compression check#3556
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/parquet_3479

Conversation

@yadavay-amzn
Copy link
Copy Markdown
Contributor

@yadavay-amzn yadavay-amzn commented May 11, 2026

Problem

FallbackValuesWriter calls isCompressionSatisfying() after the first page to decide whether dictionary encoding is worthwhile. With modern page-index defaults (~20k rows per page), this check fires too early for moderate-cardinality columns — dictionary encoding gets abandoned before enough data has accumulated to show its benefit, resulting in significantly larger files.

As reported in #3479, a column with 1M int64 values mod 32768 produces 8.4MB with the premature fallback vs 2.2MB when dictionary encoding is preserved.

Fix

Add a configurable property ParquetProperties.isDictionaryEarlyCheckEnabled() (default: true for backward compatibility) that controls whether the first-page compression check is performed in FallbackValuesWriter.getBytes().

When disabled, dictionary encoding is only abandoned when the dictionary itself exceeds size limits (shouldFallBack()), not based on the first-page compression ratio.

Changes

  • ParquetProperties: added dictionaryEarlyCheckEnabled field, getter, and builder method
  • FallbackValuesWriter: added overloaded of() factory and constructor accepting the flag; guarded the isCompressionSatisfying call
  • DefaultValuesWriterFactory: passes the config through to FallbackValuesWriter.of()
  • New test TestFallbackValuesWriter: verifies dictionary encoding is preserved when the check is disabled

Testing

  • New unit tests pass (2/2)
  • Existing parquet-column tests unaffected (default true preserves existing behavior)

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

@Fokko Could you take a look? This adds a config (parquet.dictionary.early.check.enabled) to disable the first-page compression check in FallbackValuesWriter. With modern page-index defaults (~20k rows/page), the check fires too early for moderate-cardinality columns, abandoning dictionary encoding prematurely. Includes unit test + E2E integration test writing real Parquet files. Thanks!

@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Jun 3, 2026

Thanks for looking into this!

While the problem is real, introducing a boolean flag dictionaryEarlyCheckEnabled feels like a band-aid fix that pushes the burden to users. Most users won't know when to manually toggle this to prevent storage inflation.

Instead of a new config, could we make the heuristic more adaptive? For example, we could delay the compression check until we've accumulated a certain amount of raw data (e.g., 1MB), or evaluate it over the first N pages rather than just the first one.

This would solve the issue out of the box without hurting usability. Thoughts?

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

Thanks for taking a look @wgtmac !

I like the idea of having a threshold on the raw data to delay the compression check. Updated the check in FallbackValuesWriter.getBytes() to only fire if rawDataByteSize >= threshold (defaulting to 1MB).

I think the page count approach (checking after N pages) is sensitive to page size configurations. So with 1KB pages you'd need many pages to accumulate meaningful data, while with 1MB pages a single page might be enough. A byte threshold adapts naturally regardless of page size settings.

That being said I'd like to know which approach you'd prefer and iterate based on it.

@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Jun 4, 2026

Thanks for the quick update. Before diving into the code detail, I have some additional questions:

  • parquet.dictionary.check.after.bytes is confusing. Does "bytes" refer to raw (uncompressed) bytes or compressed/encoded bytes? Please rename it to be explicit (e.g., parquet.dictionary.check.after.raw.bytes) and update the documentation accordingly.
  • In FallbackValuesWriter, rawDataByteSize is reset to 0 at the end of every page via reset(). With modern configurations (e.g., 20k rows per page), a single page's size will almost never reach the 1MB threshold. This effectively disables the compression check entirely. Did you intend for this data size to accumulate across pages, or is permanently bypassing the check for small pages your actual design?
  • Since nulls are encoded in definition levels and do not contribute to rawDataByteSize, a page with a high null ratio will have an extremely small raw data size. How does this algorithm account for heavily null-populated columns? Will it also indefinitely bypass the check?
  • It would be good to make the default behavior unchanged since many downstream environments (e.g. Apache Spark) unfortunately check produced file metadata (including encoding, file size, etc.) in their integration tests so behavior change may break them.

cc @gszadovszky @Fokko for additional advice.

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

yadavay-amzn commented Jun 4, 2026

@wgtmac thanks for spotting that. Updated:

  1. Renamed config to parquet.dictionary.check.after.raw.bytes to clarify it refers to uncompressed bytes.

  2. Fixed the reset issue. You were right about rawDataByteSize resets per page via reset(), so it would never reach a 1MB threshold. Added a separate cumulativeRawBytes counter that accumulates across pages and only resets in resetDictionary() (between column chunks). The threshold gate uses cumulativeRawBytes, the actual compression comparison still uses the current page's rawDataByteSize vs encoded size — same comparison as before.

  3. Default is now 0 (backward compatible, check fires on first page, same as old firstPage behavior). Users can opt in to a higher threshold to delay the check.

  4. Null-heavy columns: with default 0 the check fires on the first page regardless of null ratio, same as before. With a higher threshold, nulls don't contribute to cumulativeRawBytes (they're in definition levels), so the threshold takes longer to reach, but, the check still eventually fires once enough non-null values accumulate. For all-null columns the check fires immediately since cumulativeRawBytes >= 0 is trivially true.

Let me know if this direction works or if you'd prefer a different approach. Thanks again for the prompt reviews!

public BytesInput getBytes() {
if (!fellBackAlready && firstPage) {
// we use the first page to decide if we're going to use this encoding
if (!fellBackAlready && !compressionChecked && cumulativeRawBytes >= checkAfterBytes) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test where the threshold is crossed only after a reset()? This is the exact case we want to protect here, since rawDataByteSize is reset per page while cumulativeRawBytes should keep accumulating across pages. The current tests with 0 and Long.MAX_VALUE would not catch a regression back to per-page counting.

/**
* Set the raw data byte threshold after which the dictionary compression check is performed.
*
* @param val byte threshold (0 means check on every page)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this says 0 means check on every page, but the implementation only checks once per column chunk, same as the old first-page behavior. Could we update this to say 0 means check on the first page / preserve the previous behavior?

public static final String BLOCK_ROW_COUNT_LIMIT = "parquet.block.row.count.limit";
public static final String PAGE_ROW_COUNT_LIMIT = "parquet.page.row.count.limit";
public static final String PAGE_WRITE_CHECKSUM_ENABLED = "parquet.page.write-checksum.enabled";
public static final String DICTIONARY_CHECK_AFTER_BYTES = "parquet.dictionary.check.after.raw.bytes";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also document parquet.dictionary.check.after.raw.bytes in the configuration list above? It would be useful to mention that this is based on raw value bytes, and nulls encoded in definition levels do not contribute to this threshold.

* @return this builder for method chaining
*/
public Builder withDictionaryCheckAfterBytes(long val) {
this.dictionaryCheckAfterBytes = val;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reject negative values here? A negative threshold effectively behaves like 0, but accepting it silently seems a bit confusing for a size-like config. Most nearby size/count options validate the input, so val >= 0 would be clearer.

@Override
public void writeByte(int value) {
rawDataByteSize += 1;
cumulativeRawBytes += 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use cumulativeRawBytes + rawDataByteSize for checking and only increase cumulativeRawBytes once a page to save some cycles?

public static final String BLOCK_ROW_COUNT_LIMIT = "parquet.block.row.count.limit";
public static final String PAGE_ROW_COUNT_LIMIT = "parquet.page.row.count.limit";
public static final String PAGE_WRITE_CHECKSUM_ENABLED = "parquet.page.write-checksum.enabled";
public static final String DICTIONARY_CHECK_AFTER_BYTES = "parquet.dictionary.check.after.raw.bytes";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, the flag name still looks a little bit unclear to me. How about renaming it to parquet.dictionary.check.threshold.raw.size.bytes and also change associated variable and function names?

@gszadovszky
Copy link
Copy Markdown
Contributor

I don't think that integration tests validating the result files metadata should be a concern. Parquet-java should have the freedom to change it's default behavior related to encoding/page limits etc. until according to the specification and the configured properties. These are not breaking changes.

I think we should make our changes as automatic and configuration free as possible. Introducing new configuration with staying with the current default behavior would lead to nobody actually using it. I'm OK with having the default value so the default behavior is unchanged until we want to benchmark the system in prod so we can come up with better defaults.

WDYT?

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

yadavay-amzn commented Jun 5, 2026

@wgtmac Addressed all comments:

  • Applied code suggestions (named arg comments, field grouping, Javadoc on cumulativeRawBytes)
  • Renamed config to parquet.dictionary.check.threshold.raw.size.bytes
  • Performance: accumulate cumulativeRawBytes once per page (in getBytes()) instead of per value
  • Added multi-page test proving threshold is crossed after reset()
  • Javadoc fixed ("0 = check on first page")
  • Negative value validation added

Thanks for the thorough review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants