Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR is based on #6796 authored by @rtreffer.

To support large decimal precisions (> 18), we do the following things in this PR:

  1. Making CatalystSchemaConverter support large decimal precision

    Decimal types with large precision are always converted to fixed-length byte array.

  2. Making CatalystRowConverter support reading decimal values with large precision

    When the precision is > 18, constructs Decimal values with an unscaled BigInteger rather than an unscaled Long.

  3. Making RowWriteSupport support writing decimal values with large precision

    In this PR we always write decimals as fixed-length byte array, because Parquet write path hasn't been refactored to conform Parquet format spec (see SPARK-6774 & SPARK-8848).

Two follow-up tasks should be done in future PRs:

  • Writing decimals as INT32, INT64 when possible while fixing SPARK-8848
  • Adding compatibility tests as part of SPARK-5463

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37552 has finished for PR 7455 at commit 7dd88ce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fail if due to followParquetFormatSpec = false.

I was trying to test "withSQLConf" but couldn't get it to work: https://github.com/apache/spark/pull/6796/files#diff-82fab6131b7092c5faa4064fd04c3d72R135

(I have to find out why I can't run tests locally, ./build/sbt sql/test fails with a compiler assertion?!?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should it fail? Decimals with large precisions are available no matter followParquetFormatSpec is true or false, right? For your local test failure, I guess a clean would probably solve the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that as stated in the PR description, this PR doesn't support writing decimals when followParquetFormatSpec is true, because it doesn't make sense yet until the whole Parquet write path is refactored to conform Parquet format spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, you removed the < 8 check too. But shouldn't followParquetFormatSpec = false generate compatible files?

I'm getting a compiler assertion on test compile and I tried cleaning :-S Anyway, must be some sort of local ******.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good question... Maybe we should just disable large decimal precisions in compatible mode? However, if we do that, this PR will only be able to read decimals with large precisions. I'll probably refactor Parquet write path for Parquet format spec in 1.5 and add proper decimal writing support then, but it's not a promise yet, since it's assigned a relatively low priority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the way you currently wrote it.
I don't see a point in keeping a "store it in a way that an older version can read" flag. You should always try a new version and then use it for real storage. And reading files written by old spark version will always be possible.

PS: solved the test thing. It looks like spark sbt somehow managed to use a local 2.9.6 scalac 0.o

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One scenario is this:

  1. You were using some old Spark version for writing Parquet files
  2. And you developed some down stream tools to process those Parquet files
  3. Then you upgraded to Spark 1.5

Parquet format spec is relatively new and few tools/systems implemented it. So it's quite possible that tools mentioned in 2 are bound to the legacy non-standard Parquet format the older Spark version adopts. If we don't provide a compatible mode, these tools are screwed up and must be rewritten.

The reason why I added large decimal precision support for compatible mode is that it just adds an extra ability that older versions don't have without breaking any existing things. I guess keeping the current behavior is OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a changelog entry to me.
(We can now write parquet files for Decimal >18, so please check compatibility if you use spark parquet files elsewhere)

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37620 has finished for PR 7455 at commit 43411f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

Waiting for merging #7441. Need to rebase against it.

@liancheng
Copy link
Contributor Author

#7441 was merged. Rebased this PR.

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38534 has finished for PR 7455 at commit a543d10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in aa19c69 Jul 27, 2015
@liancheng
Copy link
Contributor Author

Merged to master. Will update #7679 to include large decimal precision support.

@rtreffer I attributed this PR to you while merging. Thanks for the original contribution!

@liancheng liancheng deleted the spark-4176 branch July 27, 2015 15:32
@rtreffer
Copy link
Contributor

@liancheng thank you for your time reviewing and implementing this! 🙇

asfgit pushed a commit that referenced this pull request Aug 8, 2015
… for precision <= 18 rather than 8

This PR fixes a minor bug introduced in #7455: when writing decimals, we should use the unscaled Long for better performance when the precision <= 18 rather than 8 (should be a typo). This bug doesn't affect correctness, but hurts Parquet decimal writing performance.

This PR also replaced similar magic numbers with newly defined constants.

Author: Cheng Lian <[email protected]>

Closes #8031 from liancheng/spark-4176/minor-fix-for-writing-decimals and squashes the following commits:

10d4ea3 [Cheng Lian] Should use unscaled Long to write decimals for precision <= 18 rather than 8
asfgit pushed a commit that referenced this pull request Aug 8, 2015
… for precision <= 18 rather than 8

This PR fixes a minor bug introduced in #7455: when writing decimals, we should use the unscaled Long for better performance when the precision <= 18 rather than 8 (should be a typo). This bug doesn't affect correctness, but hurts Parquet decimal writing performance.

This PR also replaced similar magic numbers with newly defined constants.

Author: Cheng Lian <[email protected]>

Closes #8031 from liancheng/spark-4176/minor-fix-for-writing-decimals and squashes the following commits:

10d4ea3 [Cheng Lian] Should use unscaled Long to write decimals for precision <= 18 rather than 8

(cherry picked from commit 11caf1c)
Signed-off-by: Cheng Lian <[email protected]>
CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
… for precision <= 18 rather than 8

This PR fixes a minor bug introduced in apache#7455: when writing decimals, we should use the unscaled Long for better performance when the precision <= 18 rather than 8 (should be a typo). This bug doesn't affect correctness, but hurts Parquet decimal writing performance.

This PR also replaced similar magic numbers with newly defined constants.

Author: Cheng Lian <[email protected]>

Closes apache#8031 from liancheng/spark-4176/minor-fix-for-writing-decimals and squashes the following commits:

10d4ea3 [Cheng Lian] Should use unscaled Long to write decimals for precision <= 18 rather than 8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants