[SPARK-29434][Core] Improve the MapStatuses Serialization Performance #26085

dbtsai · 2019-10-10T22:46:16Z

What changes were proposed in this pull request?

Instead of using GZIP for compressing the serialized MapStatuses, ZStd provides better compression rate and faster compression time.

The original approach is serializing and writing data directly into GZIPOutputStream as one step; however, the compression time is faster if a bigger chuck of the data is processed by the codec at once. As a result, in this PR, the serialized data is written into an uncompressed byte array first, and then the data is compressed. For smaller MapStatues, we find it's 2x faster.

Here is the benchmark result.

20k map outputs, and each has 500 blocks

ZStd two steps in this PR: 0.402 ops/ms, 89,066 bytes
ZStd one step as the original approach: 0.370 ops/ms, 89,069 bytes
GZip: 0.092 ops/ms, 217,345 bytes

20k map outputs, and each has 5 blocks

ZStd two steps in this PR: 0.9 ops/ms, 75,449 bytes
ZStd one step as the original approach: 0.38 ops/ms, 75,452 bytes
GZip: 0.21 ops/ms, 160,094 bytes

Why are the changes needed?

Decrease the time for serializing the MapStatuses in large scale job.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

temp temp

dbtsai · 2019-10-10T22:47:04Z

ping @dongjoon-hyun @holdenk @viirya @tgravescs

core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala

dongjoon-hyun · 2019-10-11T03:02:43Z

Thank you for pinging me, @dbtsai . I'll take a look tomorrow.

advancedxy

About current approach, what about that chunks are big enough that compressing directly is faster than two-step approach?

should we add an adaptive approach to choose the optimal one?

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

advancedxy · 2019-10-11T03:03:39Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

-    out.write(DIRECT)
-    val objOut = new ObjectOutputStream(new GZIPOutputStream(out))
+    import scala.language.reflectiveCalls
+    val out = new ByteArrayOutputStream(4096) {


Just curious, is there any specific reason to choose 4096 here?

I just choose something bigger. Any suggestion?

Which apache's ByteArrayOutputStream implementation, I don't need to put a larger initial value anymore, and that implementation doesn't have to allocate the new memory buffer, and then copy over.

viirya · 2019-10-10T23:38:17Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

-import java.util.zip.{GZIPInputStream, GZIPOutputStream}

+import com.github.luben.zstd.ZstdInputStream
+import com.github.luben.zstd.ZstdOutputStream


I recall we put third party imports under java and scala.

core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala

tgravescs · 2019-10-11T13:37:26Z

I haven't looked at the code yet, can you clarify what ops/ms is measuring here?

For smaller MapStatuses, we find it's 2x faster.

Just to verify, this holds true for small jobs, where you have 10 outputs or 100 outputs?

Did you try with say 2000 or 5000 blocks? The performance numbers for zstd ops/ms for the 2 step dropped a lot between 5 and 500, whereas the 1 step didn't drop nearly as much, wondering if 1 step becomes faster at some point.

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

dbtsai · 2019-10-11T23:32:18Z

@tgravescs it's record / ms. When the num of blocks are large, two steps and one step will have similar result, but two step will never be slower.

I switched to use org.apache.commons.io.output.ByteArrayOutputStream instead of the standard ByteArrayOutputStream, and saw another 25% performance gain.

In Apache's ByteArrayOutputStream doc, it says

/* In contrast
 * to the original it doesn't reallocate the whole memory block but allocates
 * additional buffers. This way no buffers need to be garbage collected and
 * the contents don't have to be copied to the new buffer.
 */

SparkQA · 2019-10-12T02:15:32Z

Test build #111936 has finished for PR 26085 at commit a601356.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-16T22:16:01Z

Test build #112193 has finished for PR 26085 at commit 8dc8fad.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

core/src/test/scala/org/apache/spark/MapStatusesSerializationBenchmark.scala

SparkQA · 2019-10-18T02:09:39Z

Test build #112239 has finished for PR 26085 at commit 0bf182a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-10-18T22:21:09Z

Hi, @dbtsai . I created a PR to your branch. Could you review and merge the updated benchmark result?

Pr 26085 dbtsai/spark#7

Pr 26085

dbtsai · 2019-10-18T22:26:32Z

@dongjoon-hyun thanks. Merged into my branch.

dongjoon-hyun

+1, LGTM. It's a nice improvement, @dbtsai . Thanks!

SparkQA · 2019-10-18T23:39:14Z

Test build #112299 has finished for PR 26085 at commit f184c4c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-19T00:09:52Z

Test build #112300 has finished for PR 26085 at commit bc6a14c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-19T00:28:30Z

Test build #112301 has finished for PR 26085 at commit 5aadd8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-19T01:26:43Z

Test build #112303 has finished for PR 26085 at commit 08c8fb2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-10-20T20:55:37Z

Thank you, @dbtsai , @tgravescs , @viirya , @MaxGekk , @advancedxy .
Merged to master.

tgravescs · 2019-10-21T14:44:12Z

sorry my late comment here, didn't have time last week to follow up, just curious if you tried like lz4 (or other) instead of zstd? I would expect zstd to be a better choice but just wondering if you tried and what was performance diff or if it made sense to make this configurable in the future.

dongjoon-hyun · 2019-10-21T15:11:22Z

That sounds like a reasonable idea. Could you make a JIRA and a PR for the configuration, @tgravescs ? I can help you the benchmark. Thanks!

tgravescs · 2019-10-21T15:40:25Z

So does that mean you know lz4 (or others) weren't tried? I was curious if it was tried already before spending time on it.

dongjoon-hyun · 2019-10-21T16:14:57Z

No. I don't think it's tried. To make it sure, let's ping @dbtsai . :)

dbtsai · 2019-10-21T17:03:43Z

@tgravescs let me try lz4 quickly, and will post the result. Thanks.

dbtsai · 2019-10-21T23:31:40Z

@tgravescs The following the result ran on my desktop. LZ4 is 5x faster but creates 1.6x bigger data. Wondering should we trade the serialization time with larger data?

ZSTD

Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.2
Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
200000 MapOutputs, 1000 blocks w/o broadcast:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Serialization                                      3340           3355          21          0.1       16700.1       1.0X
Deserialization                                     650            660          14          0.3        3248.6       5.1X

Compressed Serialized MapStatus sizes: 123 MB
Compressed Serialized Broadcast MapStatus sizes: 0 bytes

LZ4

Running benchmark: 200000 MapOutputs, 1000 blocks w/o broadcast
  Running case: Serialization
  Stopped after 3 iterations, 2109 ms
  Running case: Deserialization
  Stopped after 5 iterations, 2424 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.2
Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
200000 MapOutputs, 1000 blocks w/o broadcast:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Serialization                                       677            703          32          0.3        3383.6       1.0X
Deserialization                                     466            485          27          0.4        2331.1       1.5X

Compressed Serialized MapStatus sizes: 194 MB
Compressed Serialized Broadcast MapStatus sizes: 0 bytes

LZF

Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.2
Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
200000 MapOutputs, 1000 blocks w/o broadcast:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Serialization                                      2199           2202           4          0.1       10994.6       1.0X
Deserialization                                     690            720          46          0.3        3450.6       3.2X

Compressed Serialized MapStatus sizes: 182 MB
Compressed Serialized Broadcast MapStatus sizes: 0 bytes

dongjoon-hyun · 2019-10-22T01:26:35Z

In this PR, ZSTD reduces the size like the following. If we choose LZ4, it seems to be a regression over the previous Apache Spark versions using Gzip in terms of the size. We need to distribute this file to all nodes, don't we? I guess that was the main reason Apache Spark preferred GZIP before.

- Compressed Serialized MapStatus sizes: 131 MB
+ Compressed Serialized MapStatus sizes: 123 MB

tgravescs · 2019-10-22T14:01:14Z

thanks for running the tests @dbtsai. Its actually faster then I expected.

yes it needs to distribute it and this is obviously using memory on the driver side. Normally if the map status is of any size it will end up being broadcast to the hosts with the message going over the wire just indicating its a broadcast. This to me isn't much different then any other broadcast thing which normally has the spark.io.compression.codec config applied for what compression to use. You may actually want it faster if you have ample network. I assume originally before we were broadcasting it, the size was definitely an issue because if it went over the max message size it would just fail. It also took a long time for large status' and slower networks. That was also before we had highlycompressed status's and such as well.

This is definitely an improvement over what we had. Perhaps we just wait and see if its an issue or if someone wants to use something other then zstd and at that point we an make it configurable if needed. I hate to add more configs if not really needed.

dongjoon-hyun · 2019-10-22T16:02:33Z

Got it. Thanks, @tgravescs .

dbtsai · 2019-10-22T21:42:10Z

Agree, this PR already speeds up the serialization a bit, and unblocks our use-case. I was initially thinking to add another configuration, but also feel we have too many tuning parameters in Spark now. Let's see what other people think, and if they need to make it faster, we can consider the option to make it configurable.

Thanks.

advancedxy · 2020-01-09T06:25:02Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+      }
+      val outArr = {
+        compressedOut.reset()
+        val zos = new ZstdOutputStream(compressedOut)


Hi, @dbtsai , I am back-porting this into our internal repo. Looks like this compression is unnecessary since arr is already compressed by zstd. Compress again with already compressed byte[] is a waste of cpu time. WDYT?

The actually value of the data (which is already compressed) will not be in the serialized form of out.writeTo(zos) as it's transient. Here, we are just serializing the reference to the actual data, and the actual data will be broadcast through TorrentBroadcast. See the next log, "Broadcast mapstatuses size = " + outArr.length + ", actual size = " + arr.length for your real data. The broadcast one is very small.

Thanks for your clarification. It's indeed not including the compressed data.

Instead of using GZIP for compressing the serialized `MapStatuses`, ZStd provides better compression rate and faster compression time. The original approach is serializing and writing data directly into `GZIPOutputStream` as one step; however, the compression time is faster if a bigger chuck of the data is processed by the codec at once. As a result, in this PR, the serialized data is written into an uncompressed byte array first, and then the data is compressed. For smaller `MapStatues`, we find it's 2x faster. Here is the benchmark result. 1. ZStd two steps in this PR: 0.402 ops/ms, 89,066 bytes 2. ZStd one step as the original approach: 0.370 ops/ms, 89,069 bytes 3. GZip: 0.092 ops/ms, 217,345 bytes 1. ZStd two steps in this PR: 0.9 ops/ms, 75,449 bytes 2. ZStd one step as the original approach: 0.38 ops/ms, 75,452 bytes 3. GZip: 0.21 ops/ms, 160,094 bytes Decrease the time for serializing the `MapStatuses` in large scale job. No. Existing tests. Closes apache#26085 from dbtsai/mapStatus. Lead-authored-by: DB Tsai <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Ref: LIHADOOP-56788

dbtsai added 6 commits October 8, 2019 11:39

temp

78bbcf2

temp temp

tempMerge branch 'master' into mapStatus

9714f39

finish

18e5bda

tune test

39502ed

ready to submit PR

3958c01

removed unused import

a4a807e

dbtsai added 2 commits October 10, 2019 15:50

revert

d0c3532

fix

ed08f2e

MaxGekk reviewed Oct 11, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala Outdated Show resolved Hide resolved

advancedxy reviewed Oct 11, 2019

View reviewed changes

dongjoon-hyun added the SPARK CORE label Oct 11, 2019

viirya reviewed Oct 11, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/MapOutputTracker.scala Show resolved Hide resolved

address feedback

a601356

update

8dc8fad

viirya reviewed Oct 16, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/MapOutputTracker.scala Show resolved Hide resolved

viirya reviewed Oct 16, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/MapStatusesSerializationBenchmark.scala Outdated Show resolved Hide resolved

Fix styling and typo

0bf182a

dbtsai added 4 commits October 18, 2019 12:06

temp

bd88abd

remove MapStatusesSerDeserBenchmark.scala

f184c4c

Merge branch 'master' into mapStatus

bc6a14c

revert unneeded change

5aadd8f

dongjoon-hyun changed the title ~~[SPARK-29434] [Core] Improve the MapStatuses Serialization Performance~~ [SPARK-29434][Core] Improve the MapStatuses Serialization Performance Oct 18, 2019

dongjoon-hyun added 2 commits October 18, 2019 22:10

Add JDK11

d7fce82

Add JDK8 result

7095b60

Merge pull request #7 from dongjoon-hyun/PR-26085

08c8fb2

Pr 26085

dongjoon-hyun approved these changes Oct 18, 2019

View reviewed changes

viirya approved these changes Oct 18, 2019

View reviewed changes

dongjoon-hyun closed this in f4d5aa4 Oct 20, 2019

tgravescs mentioned this pull request Nov 7, 2019

[SPARK-29576][Core] Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus #26235

Closed

dbtsai deleted the mapStatus branch November 11, 2019 23:06

advancedxy reviewed Jan 9, 2020

View reviewed changes

[SPARK-29434][Core] Improve the MapStatuses Serialization Performance #26085

[SPARK-29434][Core] Improve the MapStatuses Serialization Performance #26085

Uh oh!

Conversation

dbtsai commented Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

20k map outputs, and each has 500 blocks

20k map outputs, and each has 5 blocks

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dbtsai commented Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 11, 2019

Uh oh!

advancedxy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

advancedxy Oct 11, 2019

Choose a reason for hiding this comment

Uh oh!

dbtsai Oct 11, 2019

Choose a reason for hiding this comment

Uh oh!

dbtsai Oct 11, 2019

Choose a reason for hiding this comment

Uh oh!

viirya Oct 10, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tgravescs commented Oct 11, 2019

Uh oh!

Uh oh!

dbtsai commented Oct 11, 2019

Uh oh!

SparkQA commented Oct 12, 2019

Uh oh!

SparkQA commented Oct 16, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Oct 18, 2019

Uh oh!

dongjoon-hyun commented Oct 18, 2019

Uh oh!

dbtsai commented Oct 18, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 18, 2019

Uh oh!

SparkQA commented Oct 19, 2019

Uh oh!

SparkQA commented Oct 19, 2019

Uh oh!

SparkQA commented Oct 19, 2019

Uh oh!

dongjoon-hyun commented Oct 20, 2019

Uh oh!

tgravescs commented Oct 21, 2019

Uh oh!

dongjoon-hyun commented Oct 21, 2019

Uh oh!

tgravescs commented Oct 21, 2019

Uh oh!

dongjoon-hyun commented Oct 21, 2019

Uh oh!

dbtsai commented Oct 21, 2019

Uh oh!

dbtsai commented Oct 21, 2019

Uh oh!

dongjoon-hyun commented Oct 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

dbtsai commented Oct 10, 2019 •

edited

Loading

dbtsai commented Oct 10, 2019 •

edited

Loading

dongjoon-hyun commented Oct 22, 2019 •

edited

Loading