PARQUET-3479: Add configuration to disable early dictionary compression check by yadavay-amzn · Pull Request #3556 · apache/parquet-java

yadavay-amzn · 2026-05-11T22:17:22Z

Problem

FallbackValuesWriter calls isCompressionSatisfying() after the first page to decide whether dictionary encoding is worthwhile. With modern page-index defaults (~20k rows per page), this check fires too early for moderate-cardinality columns — dictionary encoding gets abandoned before enough data has accumulated to show its benefit, resulting in significantly larger files.

As reported in #3479, a column with 1M int64 values mod 32768 produces 8.4MB with the premature fallback vs 2.2MB when dictionary encoding is preserved.

Fix

Add a configurable property ParquetProperties.isDictionaryEarlyCheckEnabled() (default: true for backward compatibility) that controls whether the first-page compression check is performed in FallbackValuesWriter.getBytes().

When disabled, dictionary encoding is only abandoned when the dictionary itself exceeds size limits (shouldFallBack()), not based on the first-page compression ratio.

Changes

ParquetProperties: added dictionaryEarlyCheckEnabled field, getter, and builder method
FallbackValuesWriter: added overloaded of() factory and constructor accepting the flag; guarded the isCompressionSatisfying call
DefaultValuesWriterFactory: passes the config through to FallbackValuesWriter.of()
New test TestFallbackValuesWriter: verifies dictionary encoding is preserved when the check is disabled

Testing

New unit tests pass (2/2)
Existing parquet-column tests unaffected (default true preserves existing behavior)

yadavay-amzn · 2026-06-03T01:38:49Z

@Fokko Could you take a look? This adds a config (parquet.dictionary.early.check.enabled) to disable the first-page compression check in FallbackValuesWriter. With modern page-index defaults (~20k rows/page), the check fires too early for moderate-cardinality columns, abandoning dictionary encoding prematurely. Includes unit test + E2E integration test writing real Parquet files. Thanks!

wgtmac · 2026-06-03T05:53:52Z

Thanks for looking into this!

While the problem is real, introducing a boolean flag dictionaryEarlyCheckEnabled feels like a band-aid fix that pushes the burden to users. Most users won't know when to manually toggle this to prevent storage inflation.

Instead of a new config, could we make the heuristic more adaptive? For example, we could delay the compression check until we've accumulated a certain amount of raw data (e.g., 1MB), or evaluate it over the first N pages rather than just the first one.

This would solve the issue out of the box without hurting usability. Thoughts?

yadavay-amzn · 2026-06-04T05:03:33Z

Thanks for taking a look @wgtmac !

I like the idea of having a threshold on the raw data to delay the compression check. Updated the check in FallbackValuesWriter.getBytes() to only fire if rawDataByteSize >= threshold (defaulting to 1MB).

I think the page count approach (checking after N pages) is sensitive to page size configurations. So with 1KB pages you'd need many pages to accumulate meaningful data, while with 1MB pages a single page might be enough. A byte threshold adapts naturally regardless of page size settings.

That being said I'd like to know which approach you'd prefer and iterate based on it.

wgtmac · 2026-06-04T05:55:28Z

Thanks for the quick update. Before diving into the code detail, I have some additional questions:

parquet.dictionary.check.after.bytes is confusing. Does "bytes" refer to raw (uncompressed) bytes or compressed/encoded bytes? Please rename it to be explicit (e.g., parquet.dictionary.check.after.raw.bytes) and update the documentation accordingly.
In FallbackValuesWriter, rawDataByteSize is reset to 0 at the end of every page via reset(). With modern configurations (e.g., 20k rows per page), a single page's size will almost never reach the 1MB threshold. This effectively disables the compression check entirely. Did you intend for this data size to accumulate across pages, or is permanently bypassing the check for small pages your actual design?
Since nulls are encoded in definition levels and do not contribute to rawDataByteSize, a page with a high null ratio will have an extremely small raw data size. How does this algorithm account for heavily null-populated columns? Will it also indefinitely bypass the check?
It would be good to make the default behavior unchanged since many downstream environments (e.g. Apache Spark) unfortunately check produced file metadata (including encoding, file size, etc.) in their integration tests so behavior change may break them.

cc @gszadovszky @Fokko for additional advice.

yadavay-amzn · 2026-06-04T21:36:20Z

@wgtmac thanks for spotting that. Updated:

Renamed config to parquet.dictionary.check.after.raw.bytes to clarify it refers to uncompressed bytes.
Fixed the reset issue. You were right about rawDataByteSize resets per page via reset(), so it would never reach a 1MB threshold. Added a separate cumulativeRawBytes counter that accumulates across pages and only resets in resetDictionary() (between column chunks). The threshold gate uses cumulativeRawBytes, the actual compression comparison still uses the current page's rawDataByteSize vs encoded size — same comparison as before.
Default is now 0 (backward compatible, check fires on first page, same as old firstPage behavior). Users can opt in to a higher threshold to delay the check.
Null-heavy columns: with default 0 the check fires on the first page regardless of null ratio, same as before. With a higher threshold, nulls don't contribute to cumulativeRawBytes (they're in definition levels), so the threshold takes longer to reach, but, the check still eventually fires once enough non-null values accumulate. For all-null columns the check fires immediately since cumulativeRawBytes >= 0 is trivially true.

Let me know if this direction works or if you'd prefer a different approach. Thanks again for the prompt reviews!

wgtmac · 2026-06-05T05:48:29Z

+    /**
+     * Set the raw data byte threshold after which the dictionary compression check is performed.
+     *
+     * @param val byte threshold (0 means check on every page)


Nit: this says 0 means check on every page, but the implementation only checks once per column chunk, same as the old first-page behavior. Could we update this to say 0 means check on the first page / preserve the previous behavior?

wgtmac · 2026-06-05T05:48:41Z

  public static final String BLOCK_ROW_COUNT_LIMIT = "parquet.block.row.count.limit";
  public static final String PAGE_ROW_COUNT_LIMIT = "parquet.page.row.count.limit";
  public static final String PAGE_WRITE_CHECKSUM_ENABLED = "parquet.page.write-checksum.enabled";
+  public static final String DICTIONARY_CHECK_AFTER_BYTES = "parquet.dictionary.check.after.raw.bytes";


Could we also document parquet.dictionary.check.after.raw.bytes in the configuration list above? It would be useful to mention that this is based on raw value bytes, and nulls encoded in definition levels do not contribute to this threshold.

wgtmac · 2026-06-05T05:48:50Z

+     * @return this builder for method chaining
+     */
+    public Builder withDictionaryCheckAfterBytes(long val) {
+      this.dictionaryCheckAfterBytes = val;


Should we reject negative values here? A negative threshold effectively behaves like 0, but accepting it silently seems a bit confusing for a size-like config. Most nearby size/count options validate the input, so val >= 0 would be clearer.

wgtmac · 2026-06-05T05:56:41Z

  @Override
  public void writeByte(int value) {
    rawDataByteSize += 1;
+    cumulativeRawBytes += 1;


Should we use cumulativeRawBytes + rawDataByteSize for checking and only increase cumulativeRawBytes once a page to save some cycles?

wgtmac · 2026-06-05T06:39:12Z

  public static final String BLOCK_ROW_COUNT_LIMIT = "parquet.block.row.count.limit";
  public static final String PAGE_ROW_COUNT_LIMIT = "parquet.page.row.count.limit";
  public static final String PAGE_WRITE_CHECKSUM_ENABLED = "parquet.page.write-checksum.enabled";
+  public static final String DICTIONARY_CHECK_AFTER_BYTES = "parquet.dictionary.check.after.raw.bytes";


TBH, the flag name still looks a little bit unclear to me. How about renaming it to parquet.dictionary.check.threshold.raw.size.bytes and also change associated variable and function names?

gszadovszky · 2026-06-05T07:37:51Z

I don't think that integration tests validating the result files metadata should be a concern. Parquet-java should have the freedom to change it's default behavior related to encoding/page limits etc. until according to the specification and the configured properties. These are not breaking changes.

I think we should make our changes as automatic and configuration free as possible. Introducing new configuration with staying with the current default behavior would lead to nobody actually using it. I'm OK with having the default value so the default behavior is unchanged until we want to benchmark the system in prod so we can come up with better defaults.

WDYT?

…ion check

yadavay-amzn · 2026-06-05T17:57:13Z

@wgtmac Addressed all comments:

Applied code suggestions (named arg comments, field grouping, Javadoc on cumulativeRawBytes)
Renamed config to parquet.dictionary.check.threshold.raw.size.bytes
Performance: accumulate cumulativeRawBytes once per page (in getBytes()) instead of per value
Added multi-page test proving threshold is crossed after reset()
Javadoc fixed ("0 = check on first page")
Negative value validation added

Thanks for the thorough review!

wgtmac

Thanks for addressing the feedback! Left some nits.

yadavay-amzn · 2026-06-08T22:15:21Z

@wgtmac Fixed the nits. Thanks again!

Copilot

Pull request overview

This PR introduces a configurable raw-byte threshold controlling when FallbackValuesWriter performs the dictionary-vs-plain compression effectiveness check, preventing premature dictionary fallback for moderate-cardinality columns under modern (~20k row) page defaults.

Changes:

Add dictionaryCheckThresholdRawSizeBytes to ParquetProperties (plus builder + public accessors) and expose it via ParquetWriter.Builder and ParquetOutputFormat configuration.
Update FallbackValuesWriter to delay the isCompressionSatisfying(...) decision until the configured cumulative raw-bytes threshold is reached (and run it once per dictionary reset).
Add unit and integration tests to validate dictionary preservation with a large threshold and fallback behavior with threshold 0.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDictionaryEarlyCheck.java	Integration test asserting file encodings differ based on the configured threshold.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java	Exposes threshold setting on the public writer builder API.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java	Adds config key + getter/setter and wires config into writer construction.
parquet-column/src/test/java/org/apache/parquet/column/values/fallback/TestFallbackValuesWriter.java	Unit tests validating fallback vs preservation and the cross-page threshold behavior.
parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java	Implements threshold-based delayed compression check with cumulative raw-byte tracking.
parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java	Passes the configured threshold into `FallbackValuesWriter`.
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java	Introduces the new property, default, accessor, and builder method.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if (cumulativeRawBytes + rawDataByteSize >= 0) {
+      cumulativeRawBytes += rawDataByteSize;
+    }


+  /**
+   * Returns the byte threshold after which the dictionary compression check is performed.
+   * A value of 0 means check on the first page. Higher values delay the check until that
+   * many raw bytes have been accumulated across pages.
+   *
+   * @return the byte threshold for the dictionary compression check
+   */
+  public long getDictionaryCheckThresholdRawSizeBytes() {
+    return dictionaryCheckThresholdRawSizeBytes;
+  }


+  public FallbackValuesWriter(I initialWriter, F fallBackWriter, long dictionaryCheckThresholdRawSizeBytes) {
    super();
    this.initialWriter = initialWriter;
    this.fallBackWriter = fallBackWriter;
    this.currentWriter = initialWriter;
+    this.dictionaryCheckThresholdRawSizeBytes = dictionaryCheckThresholdRawSizeBytes;
  }


Co-authored-by: Gang Wu <ustcwg@gmail.com>

wgtmac · 2026-06-11T08:27:07Z

+    if (!fellBackAlready && !compressionChecked && cumulativeRawBytes >= dictionaryCheckThresholdRawSizeBytes) {
+      compressionChecked = true;
      BytesInput bytes = initialWriter.getBytes();
      if (!initialWriter.isCompressionSatisfying(rawDataByteSize, bytes.size())) {


I think the threshold gate now accumulates raw bytes across pages, but the compression check still compares the current page raw size against the current page encoded size plus the whole accumulated dictionary size. This seems biased toward fallback once the check is delayed beyond the first page. Should this comparison use cumulative raw/encoded sizes, or otherwise avoid charging the full column-chunk dictionary size to a single page? A finite-threshold test with repeated/moderate-cardinality values across pages would help verify the intended PARQUET-3479 case.

…nary check

yadavay-amzn · 2026-06-13T02:11:55Z

Thanks @wgtmac. You're right, the trigger became cumulative but the comparison was still per-page against the cumulative dictionary, which biases toward fallback the later the check fires.

I've pushed a fix that tracks cumulativeEncodedBytes alongside cumulativeRawBytes and changes the decision to isCompressionSatisfying(cumulativeRawBytes, cumulativeEncodedBytes), so the column-chunk dictionary is amortized over all the pages it covers rather than charged against a single page. The encoded size is accumulated from the BytesInput already produced each page (no extra getBytes() calls), and both accumulators now use Math.addExact consistently and reset together in resetDictionary().

I added a test (TestFallbackCumulativeBias) that reproduces exactly the case you described; repeated/moderate-cardinality values across multiple pages with a finite threshold that fires on a later page. The arithmetic witness:

Before (per-page): encoded(93) + dict(400) = 493 >= pageRaw(400) -> fallback
After (cumulative): totalEncoded(173) + dict(400) = 573 < totalRaw(800) -> dictionary kept

It fails on the previous commit and passes now. I also added an adversarial test confirming a genuinely high-cardinality column still correctly falls back (the fix amortizes the dictionary cost, it doesn't disable fallback), plus a resetDictionary isolation test so one chunk's history doesn't leak into the next.

yadavay-amzn force-pushed the fix/parquet_3479 branch from ddf1332 to 2609cdc Compare June 3, 2026 01:38

yadavay-amzn force-pushed the fix/parquet_3479 branch from 2609cdc to 287f352 Compare June 3, 2026 05:31

yadavay-amzn force-pushed the fix/parquet_3479 branch from 287f352 to c7cf83e Compare June 4, 2026 04:49

yadavay-amzn force-pushed the fix/parquet_3479 branch from c7cf83e to 3fcc6c8 Compare June 4, 2026 21:26

wgtmac reviewed Jun 5, 2026

View reviewed changes

Comment thread ...uet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java Outdated

wgtmac reviewed Jun 5, 2026

View reviewed changes

yadavay-amzn force-pushed the fix/parquet_3479 branch from 5aa608f to fefdf41 Compare June 5, 2026 17:34

PARQUET-3479: Add configurable threshold to delay dictionary compress…

2be0d8b

…ion check

yadavay-amzn force-pushed the fix/parquet_3479 branch from fefdf41 to 2be0d8b Compare June 5, 2026 17:39

wgtmac reviewed Jun 8, 2026

View reviewed changes

yadavay-amzn force-pushed the fix/parquet_3479 branch 2 times, most recently from 5c55627 to fa33930 Compare June 8, 2026 22:05

wgtmac requested a review from Copilot June 9, 2026 05:24

Copilot started reviewing on behalf of wgtmac June 9, 2026 05:25 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

wgtmac reviewed Jun 9, 2026

View reviewed changes

Comment thread ...uet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java Outdated

Apply suggestion from @wgtmac

a04c28a

Co-authored-by: Gang Wu <ustcwg@gmail.com>

yadavay-amzn force-pushed the fix/parquet_3479 branch from fa33930 to a04c28a Compare June 9, 2026 18:45

wgtmac reviewed Jun 11, 2026

View reviewed changes

PARQUET-3479: Compare cumulative raw/encoded sizes for delayed dictio…

1982cec

…nary check

Uh oh!

Conversation

yadavay-amzn commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Changes

Testing

Uh oh!

yadavay-amzn commented Jun 3, 2026

Uh oh!

wgtmac commented Jun 3, 2026

Uh oh!

yadavay-amzn commented Jun 4, 2026

Uh oh!

wgtmac commented Jun 4, 2026

Uh oh!

yadavay-amzn commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wgtmac Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

wgtmac Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

wgtmac Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

wgtmac Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gszadovszky commented Jun 5, 2026

Uh oh!

yadavay-amzn commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yadavay-amzn commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

wgtmac Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

yadavay-amzn commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yadavay-amzn commented May 11, 2026 •

edited

Loading

yadavay-amzn commented Jun 4, 2026 •

edited

Loading

yadavay-amzn commented Jun 5, 2026 •

edited

Loading

yadavay-amzn commented Jun 13, 2026 •

edited

Loading