-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4386] Improve performance when writing Parquet files. #3254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
ok to test |
|
Good catch, LGTM, thanks! |
|
Test build #23369 has started for PR 3254 at commit
|
|
Test build #23369 has finished for PR 3254 at commit
|
|
Test PASSed. |
|
This is great, thanks for looking into this! We haven't done much profiling of some of these critical code sections yet. I wonder if there aren't other places where we are being sub-optimal. In general, I wonder if it isn't a good idea to make sure that in the critical parts we convert to raw I've merged to master and 1.2 to make sure this optimization at least makes the next release. |
If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?").
This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'.
Author: Jim Carroll <[email protected]>
Closes #3254 from jimfcarroll/parquet-perf and squashes the following commits:
30cc0b5 [Jim Carroll] Improve performance when writing Parquet files.
(cherry picked from commit f76b968)
Signed-off-by: Michael Armbrust <[email protected]>
|
Cool. Thanks. I had some pretty poor performance on a 7.5 million row, 500 column data set so I profiled it. Some of the slowness was my code and some was that 'size' call. The problem was exacerbated by wide datasets since 'size' was called on a per-column basis for each row and the 'size' call itself is proportionally slower the wider the dataset. I was actually surprised the scala implemetation of that List (which had the work "optimied" in its name) didn't cache the size the first time it calculated it. |
|
Hi, I am writing a table with 575 columns (about 2/3rds are nulls in each row) and 120M rows - using Spark 1.2 which has this change in. My write is about 1K rows pers second - dominated by access to attributes(index) in RowWriteSupport. There are some further optimizations with respect to attributes field that are big wins. In looping code in RowWriteSupport.write you have: The 3 calls to attrbutes(index) are really slow as attributes is LinearSeqOptimized and apply is order N. I tested locally having converted attributes to array and I now write at 6K rows a second, with time spend now mostly outside Spark. Conversion to array is not expensive and is done in init. Caching attributeSize is then no longer necessary. I'll create a pull request to make this change after I have done some more profiling. |
|
@MickDavies Thanks for pinning down the hotspot, looking forward to your PR :) |
|
Thanks @MickDavies. I was JUST about to start profiling it again. This is the same scala class I originally had issues with. |
|
Just an FYI. I changed line 141 of ParquetTableSupport.scala ( spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala Line 141 in ad42b28
attributes = ParquetTypesConverter.convertFromString(origAttributesStr)to look like this: attributes = ParquetTypesConverter.convertFromString(origAttributesStr).toArray[Attribute](I also changed the type of attributes to an Array[Attribute]). As @MickDavies said, this seems to have a fairly dramatic affect on the performance. |
|
@jimfcarroll - that's exactly the change I made. Performance improvements are very substantial for wide tables, as I said in the case I was looking at 6x as fast, but more significant still if you just consider just processing in Spark. Thanks for making the improvement. |
|
@MickDavies , I'm not a spark committer so I think they're still looking for a PR. I just wanted to let everyone know your improvement is substantial. |
|
Yeah, would love to see a PR for this :) |
|
@jimfcarroll sorry I misunderstood your comment. Good that you have verified performance gain. I have added a PR. It is number 3843. |
|
@MickDavies thanks. I needed the change and was beginning the process of profiling again. 5.5 million rows, 2000+ columns took over 15 hours to create a Parquet file for me so I incorporated your change when I saw your description. |
If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?").
This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'.