-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4176] [SQL] Supports decimal types with precision > 18 in Parquet #7455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #37552 has finished for PR 7455 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should fail if due to followParquetFormatSpec = false.
I was trying to test "withSQLConf" but couldn't get it to work: https://github.com/apache/spark/pull/6796/files#diff-82fab6131b7092c5faa4064fd04c3d72R135
(I have to find out why I can't run tests locally, ./build/sbt sql/test fails with a compiler assertion?!?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should it fail? Decimals with large precisions are available no matter followParquetFormatSpec is true or false, right? For your local test failure, I guess a clean would probably solve the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that as stated in the PR description, this PR doesn't support writing decimals when followParquetFormatSpec is true, because it doesn't make sense yet until the whole Parquet write path is refactored to conform Parquet format spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, you removed the < 8 check too. But shouldn't followParquetFormatSpec = false generate compatible files?
I'm getting a compiler assertion on test compile and I tried cleaning :-S Anyway, must be some sort of local ******.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, good question... Maybe we should just disable large decimal precisions in compatible mode? However, if we do that, this PR will only be able to read decimals with large precisions. I'll probably refactor Parquet write path for Parquet format spec in 1.5 and add proper decimal writing support then, but it's not a promise yet, since it's assigned a relatively low priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the way you currently wrote it.
I don't see a point in keeping a "store it in a way that an older version can read" flag. You should always try a new version and then use it for real storage. And reading files written by old spark version will always be possible.
PS: solved the test thing. It looks like spark sbt somehow managed to use a local 2.9.6 scalac 0.o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One scenario is this:
- You were using some old Spark version for writing Parquet files
- And you developed some down stream tools to process those Parquet files
- Then you upgraded to Spark 1.5
Parquet format spec is relatively new and few tools/systems implemented it. So it's quite possible that tools mentioned in 2 are bound to the legacy non-standard Parquet format the older Spark version adopts. If we don't provide a compatible mode, these tools are screwed up and must be rewritten.
The reason why I added large decimal precision support for compatible mode is that it just adds an extra ability that older versions don't have without breaking any existing things. I guess keeping the current behavior is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a changelog entry to me.
(We can now write parquet files for Decimal >18, so please check compatibility if you use spark parquet files elsewhere)
|
Test build #37620 has finished for PR 7455 at commit
|
|
Waiting for merging #7441. Need to rebase against it. |
|
#7441 was merged. Rebased this PR. |
|
Test build #38534 has finished for PR 7455 at commit
|
|
@liancheng thank you for your time reviewing and implementing this! 🙇 |
… for precision <= 18 rather than 8 This PR fixes a minor bug introduced in #7455: when writing decimals, we should use the unscaled Long for better performance when the precision <= 18 rather than 8 (should be a typo). This bug doesn't affect correctness, but hurts Parquet decimal writing performance. This PR also replaced similar magic numbers with newly defined constants. Author: Cheng Lian <[email protected]> Closes #8031 from liancheng/spark-4176/minor-fix-for-writing-decimals and squashes the following commits: 10d4ea3 [Cheng Lian] Should use unscaled Long to write decimals for precision <= 18 rather than 8
… for precision <= 18 rather than 8 This PR fixes a minor bug introduced in #7455: when writing decimals, we should use the unscaled Long for better performance when the precision <= 18 rather than 8 (should be a typo). This bug doesn't affect correctness, but hurts Parquet decimal writing performance. This PR also replaced similar magic numbers with newly defined constants. Author: Cheng Lian <[email protected]> Closes #8031 from liancheng/spark-4176/minor-fix-for-writing-decimals and squashes the following commits: 10d4ea3 [Cheng Lian] Should use unscaled Long to write decimals for precision <= 18 rather than 8 (cherry picked from commit 11caf1c) Signed-off-by: Cheng Lian <[email protected]>
… for precision <= 18 rather than 8 This PR fixes a minor bug introduced in apache#7455: when writing decimals, we should use the unscaled Long for better performance when the precision <= 18 rather than 8 (should be a typo). This bug doesn't affect correctness, but hurts Parquet decimal writing performance. This PR also replaced similar magic numbers with newly defined constants. Author: Cheng Lian <[email protected]> Closes apache#8031 from liancheng/spark-4176/minor-fix-for-writing-decimals and squashes the following commits: 10d4ea3 [Cheng Lian] Should use unscaled Long to write decimals for precision <= 18 rather than 8
This PR is based on #6796 authored by @rtreffer.
To support large decimal precisions (> 18), we do the following things in this PR:
Making
CatalystSchemaConvertersupport large decimal precisionDecimal types with large precision are always converted to fixed-length byte array.
Making
CatalystRowConvertersupport reading decimal values with large precisionWhen the precision is > 18, constructs
Decimalvalues with an unscaledBigIntegerrather than an unscaledLong.Making
RowWriteSupportsupport writing decimal values with large precisionIn this PR we always write decimals as fixed-length byte array, because Parquet write path hasn't been refactored to conform Parquet format spec (see SPARK-6774 & SPARK-8848).
Two follow-up tasks should be done in future PRs:
INT32,INT64when possible while fixing SPARK-8848