[SPARK-4386] Improve performance when writing Parquet files. #3254

jimfcarroll · 2014-11-13T20:43:25Z

If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?").

This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'.

AmplabJenkins · 2014-11-13T20:47:10Z

Can one of the admins verify this patch?

liancheng · 2014-11-14T14:18:11Z

ok to test

liancheng · 2014-11-14T14:18:59Z

Good catch, LGTM, thanks!

SparkQA · 2014-11-14T14:25:00Z

Test build #23369 has started for PR 3254 at commit 30cc0b5.

This patch merges cleanly.

SparkQA · 2014-11-14T15:39:00Z

Test build #23369 has finished for PR 3254 at commit 30cc0b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-14T15:39:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23369/
Test PASSed.

marmbrus · 2014-11-14T23:15:41Z

This is great, thanks for looking into this! We haven't done much profiling of some of these critical code sections yet. I wonder if there aren't other places where we are being sub-optimal.

In general, I wonder if it isn't a good idea to make sure that in the critical parts we convert to raw Arrays that have constant time length functions and lookups (and also avoid function call overhead for both if I understand correctly).

I've merged to master and 1.2 to make sure this optimization at least makes the next release.

If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?"). This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'. Author: Jim Carroll <[email protected]> Closes #3254 from jimfcarroll/parquet-perf and squashes the following commits: 30cc0b5 [Jim Carroll] Improve performance when writing Parquet files. (cherry picked from commit f76b968) Signed-off-by: Michael Armbrust <[email protected]>

jimfcarroll · 2014-11-16T16:40:46Z

Cool. Thanks. I had some pretty poor performance on a 7.5 million row, 500 column data set so I profiled it. Some of the slowness was my code and some was that 'size' call.

The problem was exacerbated by wide datasets since 'size' was called on a per-column basis for each row and the 'size' call itself is proportionally slower the wider the dataset.

I was actually surprised the scala implemetation of that List (which had the work "optimied" in its name) didn't cache the size the first time it calculated it.

MickDavies · 2014-12-23T10:55:33Z

Hi,

I am writing a table with 575 columns (about 2/3rds are nulls in each row) and 120M rows - using Spark 1.2 which has this change in.

My write is about 1K rows pers second - dominated by access to attributes(index) in RowWriteSupport. There are some further optimizations with respect to attributes field that are big wins.

In looping code in RowWriteSupport.write you have:

  if (record(index) != null) {
    writer.startField(attributes(index).name, index)
    writeValue(attributes(index).dataType, record(index))
    writer.endField(attributes(index).name, index)
  }

The 3 calls to attrbutes(index) are really slow as attributes is LinearSeqOptimized and apply is order N. I tested locally having converted attributes to array and I now write at 6K rows a second, with time spend now mostly outside Spark. Conversion to array is not expensive and is done in init. Caching attributeSize is then no longer necessary.

I'll create a pull request to make this change after I have done some more profiling.

liancheng · 2014-12-23T12:55:56Z

@MickDavies Thanks for pinning down the hotspot, looking forward to your PR :)

jimfcarroll · 2014-12-23T15:12:09Z

Thanks @MickDavies. I was JUST about to start profiling it again. This is the same scala class I originally had issues with.

jimfcarroll · 2014-12-23T23:48:53Z

Just an FYI. I changed line 141 of ParquetTableSupport.scala (

spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala

Line 141 in ad42b28

attributes = ParquetTypesConverter.convertFromString(origAttributesStr)

) from this

attributes = ParquetTypesConverter.convertFromString(origAttributesStr)

to look like this:

attributes = ParquetTypesConverter.convertFromString(origAttributesStr).toArray[Attribute]

(I also changed the type of attributes to an Array[Attribute]). As @MickDavies said, this seems to have a fairly dramatic affect on the performance.

MickDavies · 2014-12-24T13:45:28Z

@jimfcarroll - that's exactly the change I made. Performance improvements are very substantial for wide tables, as I said in the case I was looking at 6x as fast, but more significant still if you just consider just processing in Spark. Thanks for making the improvement.

jimfcarroll · 2014-12-29T19:44:02Z

@MickDavies , I'm not a spark committer so I think they're still looking for a PR. I just wanted to let everyone know your improvement is substantial.

marmbrus · 2014-12-29T22:03:19Z

Yeah, would love to see a PR for this :)

MickDavies · 2014-12-30T13:20:24Z

@jimfcarroll sorry I misunderstood your comment. Good that you have verified performance gain.

I have added a PR. It is number 3843.

jimfcarroll · 2014-12-30T14:48:53Z

@MickDavies thanks. I needed the change and was beginning the process of profiling again. 5.5 million rows, 2000+ columns took over 15 hours to create a Parquet file for me so I incorporated your change when I saw your description.

Improve performance when writing Parquet files.

30cc0b5

asfgit closed this in f76b968 Nov 15, 2014

jimfcarroll deleted the parquet-perf branch December 23, 2014 16:07

[SPARK-4386] Improve performance when writing Parquet files. #3254

[SPARK-4386] Improve performance when writing Parquet files. #3254

Uh oh!

Conversation

jimfcarroll commented Nov 13, 2014

Uh oh!

AmplabJenkins commented Nov 13, 2014

Uh oh!

liancheng commented Nov 14, 2014

Uh oh!

liancheng commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

SparkQA commented Nov 14, 2014

Uh oh!

AmplabJenkins commented Nov 14, 2014

Uh oh!

marmbrus commented Nov 14, 2014

Uh oh!

jimfcarroll commented Nov 16, 2014

Uh oh!

MickDavies commented Dec 23, 2014

Uh oh!

liancheng commented Dec 23, 2014

Uh oh!

jimfcarroll commented Dec 23, 2014

Uh oh!

jimfcarroll commented Dec 23, 2014

Uh oh!

MickDavies commented Dec 24, 2014

Uh oh!

jimfcarroll commented Dec 29, 2014

Uh oh!

marmbrus commented Dec 29, 2014

Uh oh!

MickDavies commented Dec 30, 2014

Uh oh!

jimfcarroll commented Dec 30, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants