Skip to content

Conversation

@jimfcarroll
Copy link

If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?").

This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@liancheng
Copy link
Contributor

ok to test

@liancheng
Copy link
Contributor

Good catch, LGTM, thanks!

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23369 has started for PR 3254 at commit 30cc0b5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23369 has finished for PR 3254 at commit 30cc0b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23369/
Test PASSed.

@marmbrus
Copy link
Contributor

This is great, thanks for looking into this! We haven't done much profiling of some of these critical code sections yet. I wonder if there aren't other places where we are being sub-optimal.

In general, I wonder if it isn't a good idea to make sure that in the critical parts we convert to raw Arrays that have constant time length functions and lookups (and also avoid function call overhead for both if I understand correctly).

I've merged to master and 1.2 to make sure this optimization at least makes the next release.

asfgit pushed a commit that referenced this pull request Nov 15, 2014
If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?").

This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'.

Author: Jim Carroll <[email protected]>

Closes #3254 from jimfcarroll/parquet-perf and squashes the following commits:

30cc0b5 [Jim Carroll] Improve performance when writing Parquet files.

(cherry picked from commit f76b968)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in f76b968 Nov 15, 2014
@jimfcarroll
Copy link
Author

Cool. Thanks. I had some pretty poor performance on a 7.5 million row, 500 column data set so I profiled it. Some of the slowness was my code and some was that 'size' call.

The problem was exacerbated by wide datasets since 'size' was called on a per-column basis for each row and the 'size' call itself is proportionally slower the wider the dataset.

I was actually surprised the scala implemetation of that List (which had the work "optimied" in its name) didn't cache the size the first time it calculated it.

@MickDavies
Copy link
Contributor

Hi,

I am writing a table with 575 columns (about 2/3rds are nulls in each row) and 120M rows - using Spark 1.2 which has this change in.

My write is about 1K rows pers second - dominated by access to attributes(index) in RowWriteSupport. There are some further optimizations with respect to attributes field that are big wins.

In looping code in RowWriteSupport.write you have:

  if (record(index) != null) {
    writer.startField(attributes(index).name, index)
    writeValue(attributes(index).dataType, record(index))
    writer.endField(attributes(index).name, index)
  }

The 3 calls to attrbutes(index) are really slow as attributes is LinearSeqOptimized and apply is order N. I tested locally having converted attributes to array and I now write at 6K rows a second, with time spend now mostly outside Spark. Conversion to array is not expensive and is done in init. Caching attributeSize is then no longer necessary.

I'll create a pull request to make this change after I have done some more profiling.

@liancheng
Copy link
Contributor

@MickDavies Thanks for pinning down the hotspot, looking forward to your PR :)

@jimfcarroll
Copy link
Author

Thanks @MickDavies. I was JUST about to start profiling it again. This is the same scala class I originally had issues with.

@jimfcarroll jimfcarroll deleted the parquet-perf branch December 23, 2014 16:07
@jimfcarroll
Copy link
Author

Just an FYI. I changed line 141 of ParquetTableSupport.scala (

attributes = ParquetTypesConverter.convertFromString(origAttributesStr)
) from this

attributes = ParquetTypesConverter.convertFromString(origAttributesStr)

to look like this:

attributes = ParquetTypesConverter.convertFromString(origAttributesStr).toArray[Attribute]

(I also changed the type of attributes to an Array[Attribute]). As @MickDavies said, this seems to have a fairly dramatic affect on the performance.

@MickDavies
Copy link
Contributor

@jimfcarroll - that's exactly the change I made. Performance improvements are very substantial for wide tables, as I said in the case I was looking at 6x as fast, but more significant still if you just consider just processing in Spark. Thanks for making the improvement.

@jimfcarroll
Copy link
Author

@MickDavies , I'm not a spark committer so I think they're still looking for a PR. I just wanted to let everyone know your improvement is substantial.

@marmbrus
Copy link
Contributor

Yeah, would love to see a PR for this :)

@MickDavies
Copy link
Contributor

@jimfcarroll sorry I misunderstood your comment. Good that you have verified performance gain.

I have added a PR. It is number 3843.

@jimfcarroll
Copy link
Author

@MickDavies thanks. I needed the change and was beginning the process of profiling again. 5.5 million rows, 2000+ columns took over 15 hours to create a Parquet file for me so I incorporated your change when I saw your description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants