Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Oct 10, 2019

What changes were proposed in this pull request?

Instead of using GZIP for compressing the serialized MapStatuses, ZStd provides better compression rate and faster compression time.

The original approach is serializing and writing data directly into GZIPOutputStream as one step; however, the compression time is faster if a bigger chuck of the data is processed by the codec at once. As a result, in this PR, the serialized data is written into an uncompressed byte array first, and then the data is compressed. For smaller MapStatues, we find it's 2x faster.

Here is the benchmark result.

20k map outputs, and each has 500 blocks

  1. ZStd two steps in this PR: 0.402 ops/ms, 89,066 bytes
  2. ZStd one step as the original approach: 0.370 ops/ms, 89,069 bytes
  3. GZip: 0.092 ops/ms, 217,345 bytes

20k map outputs, and each has 5 blocks

  1. ZStd two steps in this PR: 0.9 ops/ms, 75,449 bytes
  2. ZStd one step as the original approach: 0.38 ops/ms, 75,452 bytes
  3. GZip: 0.21 ops/ms, 160,094 bytes

Why are the changes needed?

Decrease the time for serializing the MapStatuses in large scale job.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

@dbtsai
Copy link
Member Author

dbtsai commented Oct 10, 2019

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @dbtsai . I'll take a look tomorrow.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About current approach, what about that chunks are big enough that compressing directly is faster than two-step approach?

should we add an adaptive approach to choose the optimal one?

out.write(DIRECT)
val objOut = new ObjectOutputStream(new GZIPOutputStream(out))
import scala.language.reflectiveCalls
val out = new ByteArrayOutputStream(4096) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is there any specific reason to choose 4096 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just choose something bigger. Any suggestion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which apache's ByteArrayOutputStream implementation, I don't need to put a larger initial value anymore, and that implementation doesn't have to allocate the new memory buffer, and then copy over.

import java.util.zip.{GZIPInputStream, GZIPOutputStream}

import com.github.luben.zstd.ZstdInputStream
import com.github.luben.zstd.ZstdOutputStream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall we put third party imports under java and scala.

@tgravescs
Copy link
Contributor

I haven't looked at the code yet, can you clarify what ops/ms is measuring here?

For smaller MapStatuses, we find it's 2x faster.

Just to verify, this holds true for small jobs, where you have 10 outputs or 100 outputs?

Did you try with say 2000 or 5000 blocks? The performance numbers for zstd ops/ms for the 2 step dropped a lot between 5 and 500, whereas the 1 step didn't drop nearly as much, wondering if 1 step becomes faster at some point.

@dbtsai
Copy link
Member Author

dbtsai commented Oct 11, 2019

@tgravescs it's record / ms. When the num of blocks are large, two steps and one step will have similar result, but two step will never be slower.

I switched to use org.apache.commons.io.output.ByteArrayOutputStream instead of the standard ByteArrayOutputStream, and saw another 25% performance gain.

In Apache's ByteArrayOutputStream doc, it says

/* In contrast
 * to the original it doesn't reallocate the whole memory block but allocates
 * additional buffers. This way no buffers need to be garbage collected and
 * the contents don't have to be copied to the new buffer.
 */

@SparkQA
Copy link

SparkQA commented Oct 12, 2019

Test build #111936 has finished for PR 26085 at commit a601356.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 16, 2019

Test build #112193 has finished for PR 26085 at commit 8dc8fad.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Test build #112239 has finished for PR 26085 at commit 0bf182a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance [SPARK-29434][Core] Improve the MapStatuses Serialization Performance Oct 18, 2019
@dongjoon-hyun
Copy link
Member

Hi, @dbtsai . I created a PR to your branch. Could you review and merge the updated benchmark result?

@dbtsai
Copy link
Member Author

dbtsai commented Oct 18, 2019

@dongjoon-hyun thanks. Merged into my branch.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. It's a nice improvement, @dbtsai . Thanks!

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Test build #112299 has finished for PR 26085 at commit f184c4c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2019

Test build #112300 has finished for PR 26085 at commit bc6a14c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2019

Test build #112301 has finished for PR 26085 at commit 5aadd8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2019

Test build #112303 has finished for PR 26085 at commit 08c8fb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you, @dbtsai , @tgravescs , @viirya , @MaxGekk , @advancedxy .
Merged to master.

@tgravescs
Copy link
Contributor

sorry my late comment here, didn't have time last week to follow up, just curious if you tried like lz4 (or other) instead of zstd? I would expect zstd to be a better choice but just wondering if you tried and what was performance diff or if it made sense to make this configurable in the future.

@dongjoon-hyun
Copy link
Member

That sounds like a reasonable idea. Could you make a JIRA and a PR for the configuration, @tgravescs ? I can help you the benchmark. Thanks!

@tgravescs
Copy link
Contributor

So does that mean you know lz4 (or others) weren't tried? I was curious if it was tried already before spending time on it.

@dongjoon-hyun
Copy link
Member

No. I don't think it's tried. To make it sure, let's ping @dbtsai . :)

@dbtsai
Copy link
Member Author

dbtsai commented Oct 21, 2019

@tgravescs let me try lz4 quickly, and will post the result. Thanks.

@dbtsai
Copy link
Member Author

dbtsai commented Oct 21, 2019

@tgravescs The following the result ran on my desktop. LZ4 is 5x faster but creates 1.6x bigger data. Wondering should we trade the serialization time with larger data?

  1. ZSTD
Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.2
Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
200000 MapOutputs, 1000 blocks w/o broadcast:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Serialization                                      3340           3355          21          0.1       16700.1       1.0X
Deserialization                                     650            660          14          0.3        3248.6       5.1X

Compressed Serialized MapStatus sizes: 123 MB
Compressed Serialized Broadcast MapStatus sizes: 0 bytes
  1. LZ4
Running benchmark: 200000 MapOutputs, 1000 blocks w/o broadcast
  Running case: Serialization
  Stopped after 3 iterations, 2109 ms
  Running case: Deserialization
  Stopped after 5 iterations, 2424 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.2
Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
200000 MapOutputs, 1000 blocks w/o broadcast:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Serialization                                       677            703          32          0.3        3383.6       1.0X
Deserialization                                     466            485          27          0.4        2331.1       1.5X

Compressed Serialized MapStatus sizes: 194 MB
Compressed Serialized Broadcast MapStatus sizes: 0 bytes
  1. LZF
Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.2
Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
200000 MapOutputs, 1000 blocks w/o broadcast:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Serialization                                      2199           2202           4          0.1       10994.6       1.0X
Deserialization                                     690            720          46          0.3        3450.6       3.2X

Compressed Serialized MapStatus sizes: 182 MB
Compressed Serialized Broadcast MapStatus sizes: 0 bytes

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 22, 2019

In this PR, ZSTD reduces the size like the following. If we choose LZ4, it seems to be a regression over the previous Apache Spark versions using Gzip in terms of the size. We need to distribute this file to all nodes, don't we? I guess that was the main reason Apache Spark preferred GZIP before.

- Compressed Serialized MapStatus sizes: 131 MB
+ Compressed Serialized MapStatus sizes: 123 MB

@tgravescs
Copy link
Contributor

thanks for running the tests @dbtsai. Its actually faster then I expected.

yes it needs to distribute it and this is obviously using memory on the driver side. Normally if the map status is of any size it will end up being broadcast to the hosts with the message going over the wire just indicating its a broadcast. This to me isn't much different then any other broadcast thing which normally has the spark.io.compression.codec config applied for what compression to use. You may actually want it faster if you have ample network. I assume originally before we were broadcasting it, the size was definitely an issue because if it went over the max message size it would just fail. It also took a long time for large status' and slower networks. That was also before we had highlycompressed status's and such as well.

This is definitely an improvement over what we had. Perhaps we just wait and see if its an issue or if someone wants to use something other then zstd and at that point we an make it configurable if needed. I hate to add more configs if not really needed.

@dongjoon-hyun
Copy link
Member

Got it. Thanks, @tgravescs .

@dbtsai
Copy link
Member Author

dbtsai commented Oct 22, 2019

Agree, this PR already speeds up the serialization a bit, and unblocks our use-case. I was initially thinking to add another configuration, but also feel we have too many tuning parameters in Spark now. Let's see what other people think, and if they need to make it faster, we can consider the option to make it configurable.

Thanks.

}
val outArr = {
compressedOut.reset()
val zos = new ZstdOutputStream(compressedOut)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @dbtsai , I am back-porting this into our internal repo. Looks like this compression is unnecessary since arr is already compressed by zstd. Compress again with already compressed byte[] is a waste of cpu time. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actually value of the data (which is already compressed) will not be in the serialized form of out.writeTo(zos) as it's transient. Here, we are just serializing the reference to the actual data, and the actual data will be broadcast through TorrentBroadcast. See the next log, "Broadcast mapstatuses size = " + outArr.length + ", actual size = " + arr.length for your real data. The broadcast one is very small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your clarification. It's indeed not including the compressed data.

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
Instead of using GZIP for compressing the serialized `MapStatuses`, ZStd provides better compression rate and faster compression time.

The original approach is serializing and writing data directly into `GZIPOutputStream` as one step; however, the compression time is faster if a bigger chuck of the data is processed by the codec at once. As a result, in this PR, the serialized data is written into an uncompressed byte array first, and then the data is compressed. For smaller `MapStatues`, we find it's 2x faster.

Here is the benchmark result.

1. ZStd two steps in this PR: 0.402 ops/ms, 89,066 bytes
2. ZStd one step as the original approach: 0.370 ops/ms, 89,069 bytes
3. GZip: 0.092 ops/ms, 217,345 bytes

1. ZStd two steps in this PR: 0.9 ops/ms, 75,449 bytes
2. ZStd one step as the original approach: 0.38 ops/ms, 75,452 bytes
3. GZip: 0.21 ops/ms, 160,094 bytes

Decrease the time for serializing the `MapStatuses` in large scale job.

No.

Existing tests.

Closes apache#26085 from dbtsai/mapStatus.

Lead-authored-by: DB Tsai <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

Ref: LIHADOOP-56788
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants