Arrow: Fix truncation of decimals with precision larger than 18#16627
Open
wombatu-kun wants to merge 1 commit into
Open
Arrow: Fix truncation of decimals with precision larger than 18#16627wombatu-kun wants to merge 1 commit into
wombatu-kun wants to merge 1 commit into
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Reading a decimal column through the vectorized Arrow reader silently corrupts values whose unscaled magnitude exceeds
Long.MAX_VALUE. This affects any decimal with precision larger than 18 (for exampledecimal(38, 0)) holding a sufficiently large value. No error is raised; the returnedBigDecimalis simply wrong, and often negative.Root cause
Decimals with precision larger than 18 are stored as a binary /
FIXED_LEN_BYTE_ARRAYand read into aFixedSizeBinaryVector. The binary-backed decimal accessors decode the bytes into the correctBigDecimaland then hand it toJavaDecimalFactory.ofBigDecimal, which rebuilds it asBigDecimal.valueOf(value.unscaledValue().longValue(), scale).BigInteger.longValue()keeps only the low 64 bits, so any unscaled value beyondLongrange is truncated. The incomingvalueis already the correctBigDecimal(it carries the right unscaled value and scale), so this round-trip is both unnecessary and lossy.The
ofLongpath used for INT32/INT64-backed decimals (precision up to 18) is unaffected, which is why only high-precision decimals are corrupted and the existing tests, which usedecimal(9, 2), never caught it.Fix
Return
valueunchanged. It already represents the decimal with the correct unscaled value and scale, matching how the Spark accessor factory preserves the full value.Tests
Added
TestArrowReader.testHighPrecisionDecimalIsReadCorrectly, which writes adecimal(38, 0)Parquet file with values larger thanLong.MAX_VALUEand asserts they round-trip through the vectorized reader. It fails before the fix (expected 12345678901234567890 but was -6101065172474983726) and passes after.