[SPARK-25299] Propose a new NIO transfer API for partition writing. #535

mccheah · 2019-04-15T22:17:30Z

This solves the consistency and resource leakage concerns with the first iteration of this API, where it would not be obvious that the streamable resources created by ShufflePartitionWriter needed to be closed by ShuffleParittionWriter#close as opposed to closing the resources directly.

This introduces the following adjustments:

Channel-based writes are separated out to their own module, SupportsTransferTo. This allows the transfer-to APIs to be modified independently, and users that only provide output streams can ignore the NIO APIs entirely. This also allows us to mark the base ShufflePartitionWriter as a stable API eventually while keeping the NIO APIs marked as experimental or developer-api.
We add APIs that explicitly encodes the notion of transferring bytes from one source to another. The partition writer returns an instance of TransferrableWritableByteChannel, which has APIs for accepting a TransferrableReadableByteChannel and can tell the readable byte channel to transfer its bytes out to some destination sink.
The resources returned by ShufflePartitionWriter are always closed. Internally, DefaultMapOutputWriter keeps resources open until commitAllPartitions() is called.
ShufflePartitionWriter no longer extends Closeable, clarifying the fact that the resources returned by the partition writer will be closed by the callers.

This solves the consistency and resource leakage concerns with the first iteration of thie API, where it would not be obvious that the streamable resources created by ShufflePartitionWriter needed to be closed by ShuffleParittionWriter#close as opposed to closing the resources directly. This introduces the following adjustments: - Channel-based writes are separated out to their own module, SupportsTransferTo. This allows the transfer-to APIs to be modified independently, and users that only provide output streams can ignore the NIO APIs entirely. This also allows us to mark the base ShufflePartitionWriter as a stable API eventually while keeping the NIO APIs marked as experimental or developer-api. - We add APIs that explicitly encodes the notion of transferring bytes from one source to another. The partition writer returns an instance of TransferrableWritableByteChannel, which has APIs for accepting a TransferrableReadableByteChannel and can tell the readable byte channel to transfer its bytes out to some destination sink. - The resources returned by ShufflePartitionWriter are always closed. Internally, DefaultMapOutputWriter keeps resources open until commitAllPartitions() is called.

…-transferto-api

ifilonenko

A few questions

ifilonenko · 2019-04-18T21:25:44Z

core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java

-                  partitionInputStream = new LimitedInputStream(spillInputStreams[i],
-                      partitionLengthInSpill, false);
-                  partitionInputStream = blockManager.serializerManager().wrapForEncryption(
+          partitionOutput = writer.openStream();


I see that you are not wrapping this around a CloseShieldOutputStream. How are you now shielding the close of the compressor?

You aren't, that's the point. The shielding of the close of the underlying file is handed by DefaultShuffleMapOutputWriter.

ifilonenko · 2019-04-18T21:31:55Z

core/src/main/java/org/apache/spark/shuffle/sort/io/DefaultShuffleMapOutputWriter.java

+    }
+
+    public long getCount() throws IOException {
+      long writtenPosition = outputFileChannel.position();


Wrap this in a try {} in case the outputFileChannel can't calculate the position of the file channel?

This throws IOException which should propagate up and cause the failure in the right place.

ifilonenko · 2019-04-18T21:33:59Z

core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala

-        partitionWriter.close()
-      }
-    }
+    mapOutputWriter.getNextPartitionWriter


Why no close here? If its an empty partition

You aren't closing partition writers anymore, only any streams that they create.

ifilonenko · 2019-04-18T21:34:35Z

core/src/main/scala/org/apache/spark/util/collection/ShufflePartitionPairsWriter.scala

-    // close objOut to flush the compressed bytes to the partition writer stream, but we don't want
-    // to close the partition output stream in the process.
-    partitionStream = new CloseShieldOutputStream(partitionWriter.toStream)
+    partitionStream = partitionWriter.openStream


I think it would be wise here to add a comment about the contract of the stream

There isn't really a contract aside from the fact that we'll close it - which is expected because OutputStream extends Closeable.

ifilonenko

I prefer this reorganization as it removes the need for the CloseShield

mccheah · 2019-04-29T16:56:50Z

@vanzin @squito - was curious what you thought about this change.

For context, if we don't do it this way, we run into a weird inconsistency where we can't close the returned channel (since it has to be an instance of FileChannel) but we can, and probably should, close partition output streams.

…-transferto-api

yifeih

I like the new API, but I left a few comments

yifeih · 2019-05-01T00:41:48Z

core/src/main/java/org/apache/spark/api/shuffle/DefaultTransferrableWritableByteChannel.java

+import java.io.IOException;
+import java.nio.channels.WritableByteChannel;
+
+public class DefaultTransferrableWritableByteChannel implements TransferrableWritableByteChannel {


move this out of the api package and into shuffle.sort along with the other implementation classes

yifeih · 2019-05-01T00:43:17Z

core/src/main/java/org/apache/spark/api/shuffle/SupportsTransferTo.java

+   * Returns the number of bytes written either by this writer's output stream opened by
+   * {@link #openStream()} or the byte channel opened by {@link #openTransferrableChannel()}.
+   */
+  @Override


Why is this getting overrided? Is it for the javadoc?

Yes it's explicitly for JavaDoc - we have to specifically say that the count has to take into account the channel if it was opened.

yifeih · 2019-05-01T01:50:09Z

core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java

-                Closeables.close(in, copyThrewException);
-              }
+        ShufflePartitionWriter writer = mapOutputWriter.getPartitionWriter(i);
+        if (file.exists()) {


i think you're missing the case for if file.exists() is false, you need to still set the copyThrewException to false

The value of copyThrewException isn't used if the file does not exist.

yifeih · 2019-05-01T01:53:33Z

core/src/main/java/org/apache/spark/api/shuffle/SupportsTransferTo.java

+
+  /**
+   * Opens and returns a {@link TransferrableWritableByteChannel} for transferring bytes from
+   * partial input byte channels to the underlying shuffle data store.


what does "partial" mean in this situation?

Partial meaning the spill file chunks. How can we clarify this without exposing information about the internals of sort-based shuffle?

Hmm I think the use of "partial" is unclear. Also, none of the other java docs say anything about spill file merging, so I think it's better to omit the word "partial"

yifeih · 2019-05-01T02:04:20Z

core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java

-          WritableByteChannel channel = writer.toChannel();
+          partitionChannel = writer instanceof SupportsTransferTo ?
+              ((SupportsTransferTo) writer).openTransferrableChannel()
+              : new DefaultTransferrableWritableByteChannel(


is it weird that we create a DefaultTransferrableWritableByteChannel here if transferToEnabled is true and the writer is an instance of SupportsTransferTo, but in BypassMergeSortShuffleWriter, we don't and both transferToEnabled has to be true and the writer must be of type SupportsTransferTo in order for us to call that codepath?

I don't mind the inconsistency too much.

But it's easily fixed so we can align these

squito · 2019-05-01T19:53:24Z

core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java

+              outputChannel = ((SupportsTransferTo) writer).openTransferrableChannel();
+              TransferrableReadableByteChannel inputTransferable =
+                  new FileTransferrableReadableByteChannel(inputChannel, 0L);
+              outputChannel.transferFrom(inputTransferable, inputChannel.size());


I don't understand the need for all the layers of indirection here -- both TransferrableReadableByteChannel and TransferrableWritableByteChannel seem unnecessary. Even with a non-default writer, inputChannel is always a FileChannel. that's not something that changes. The plugin just needs to provide a WritableByteChannel. then you'd do something like

WriteableByteChannel outputChannel = ((SupportsTransferTo) writer).openWritableChannel(); Utils.copyFileStreamNIO(inputChannel, outputChannel, 0, inputChannel.size());

The problem is the asymmetry of how to handle closing the channel vs. closing the output stream.

If we call outputChannel.close() here, that closes the underlying output file, but that's not what we want to do in the default writer implementation.

If we don't, we risk leaking channels in non-default implementations.

We should close the output stream as returned if toStream is called, but, we can't close the output stream and not close the channel - they should be symmetrical.

If we have the default writer return a channel that isn't a FileChannel, say, some implementation of a channel that shields from closing, then input.transferTo(output) won't do the right thing - it won't do the zero-byte copy, since implementations of FileChannel check that the output channel is specifically an instanceof FileChannel.

The whole premise of this refactor is to make it possible to close the output channel, but to have the implementations be able to decide whether or not that closes the output resource.

ah, ok, sorry I had not really grasped that part before -- though I still might be missing something

so even given that, still seems like my comments apply to TransferrableReadableByteChannel?

then shouldn't your implementation of DefaultTransferrableWritableByteChannel be a no-op? The whole point is you don't want to close that channel? and wouldn't you still want that close to be in outer finally which closes the entire writer, since its the thing which owns the outputChannel which spans everything? Seems like that would go for streams or channels, and most of the storage plugins we've discussed would want that.

I have a feeling I'm not getting something fundamental here ... I will go back and take a closer look at the code before this change.

The idea is that we don't want to make the DefaultShufflePartitionWriter extend Closeable, because that decouples the lifecycle of closing the writer from the lifecycle of closing the channel or stream. It seems dubious to have the writer return a Closeable object that we do not close but we expect to be closed via other means.

For TransferrableReadableByteChannel I created that to hide the fact that we're transferring from a file - thus it's more so an abstraction for removing knowledge that bytes are transferred from files to the output.

The main problem being that passing in FileChannel directly can expose information like writing to the file? Basically seems ideal to create an abstraction that strictly limits the options that the plugin can do with it.

ok lemme step back a bit and make sure I understand the issue correctly.

I guess there are two styles of shuffle storage plugins we're considered about here. Type (a) will store all the mapoutput together (perhaps in a local file, like spark's current shuffle storage, or maybe in some distributed storage, but still its all written to one socket / channel / outputstream). Type (b) will write each reduce-partition output, from this one map task, a separate destination.

Type (a) is fine with the implementation you have before this change.

But type (b) had a problem, because the output streams don't get closed when we're done with them. They should get closed when the entire output of this mapper is done, as long as the user handles that properly in writer.close(). But in addition to being a cumbersome api, it also leaves open those outputstreams / channels far too long.

Its worth noting that type (b) is not like the old hash shuffle manager, which intentionally kept a outputstream open for each reduce Partition, because that just wrote records immediately as it received them. This is doing a local sort first, which should let this cut down the number of open outputstreams (which is where the hash shuffle manager would fall over, and why you really want to make sure you're closing them as soon as possible).

Before my attempting to propose another solution -- have I at least described the problem correctly?

I think that's correct - we want to give the API that allows for closing each partition stream individually but the default implementation (and perhaps others) can internally shield the close of the returned streams.

ok I think I finally get it and what you are doing. makes sense, I would like more comments in the code explaining the "why" a little bit.

squito · 2019-05-01T19:58:30Z

core/src/main/java/org/apache/spark/api/shuffle/SupportsTransferTo.java

+   * partial input byte channels to the underlying shuffle data store.
+   */
+  default TransferrableWritableByteChannel openTransferrableChannel() throws IOException {
+    return new DefaultTransferrableWritableByteChannel(Channels.newChannel(openStream()));


why provide a default implementation here? don't we actually want to discourage implementations from even declaring they provide this interface, unless they provide a smarter implementation here?

squito · 2019-05-02T03:47:09Z

core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java

+                    Channels.newChannel(writer.openStream()));
              }
+              TransferrableReadableByteChannel inputTransferable =
+                  new FileTransferrableReadableByteChannel(inputChannel, 0L);


seperating out another thread for the need for this interface:

The main problem being that passing in FileChannel directly can expose information like writing to the file? Basically seems ideal to create an abstraction that strictly limits the options that the plugin can do with it.

I guess I'm not too concerned about that. I mean the plugin could also ignore the data in the file entirely, or shift all the bits left, or other random things. this is low-level enough that whoever is touching this needs to have some basic understanding of what they should do with it. IMO more indirection makes it harder to understand. In particular, you are trying to signal to a plugin author that they can leverage the transferTo which is only available on FileChannel.

squito · 2019-05-02T03:49:08Z

core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java

+              if (writer instanceof SupportsTransferTo) {
+                outputChannel = ((SupportsTransferTo) writer).openTransferrableChannel();
+              } else {
+                outputChannel = new DefaultTransferrableWritableByteChannel(


any point in having this else branch here? shouldn't the if (transferToEnabled) above also check SupportsTransferTo? I don't think you'll get any performance benefits from this with a channel wrapping an outputstream

This is to remain consistent with UnsafeShuffleWriter. See also #535 (comment)

mccheah · 2019-05-07T23:08:09Z

Address comments either in follow-up discussion or code in the latest patch @squito.

mccheah · 2019-05-24T19:14:52Z

@squito I think I addressed the comments, I'm going to merge and we can address feedback in a follow up.

svc-spark-25299

================================================================================================
BlockStoreShuffleReader reader
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
no aggregation or sorting:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
local fetch                                       11089          11137          19          0.9        1108.9       1.0X
remote rpc fetch                                  11093          11159          43          0.9        1109.3       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
with aggregation:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
local fetch                                       29309          29564         229          0.1       14654.7       1.0X
remote rpc fetch                                  29184          29525         339          0.1       14592.1       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
with sorting:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
local fetch                                       30166          30529         230          0.1       15082.8       1.0X
remote rpc fetch                                  30361          30693         303          0.1       15180.3       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
with seek:                                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
seek to last record                                   1              1           1       2779.2           0.4       1.0X

================================================================================================
BypassMergeSortShuffleWriter write
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
BypassMergeSortShuffleWrite without spill:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without disk spill                      2              4           3          0.5        2018.8       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
BypassMergeSortShuffleWrite with spill:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
without transferTo                                 7215           7348          88          0.9        1075.2       1.0X
with transferTo                                    7325           7384          63          0.9        1091.5       1.0X

================================================================================================
SortShuffleWriter writer
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
SortShuffleWriter without spills:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without spills                         10             16           4          0.1        9934.6       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
SortShuffleWriter with spills:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
no map side combine                               14324          14456          94          0.5        2134.4       1.0X
with map side aggregation                         14282          14414          93          0.5        2128.2       1.0X
with map side sort                                14289          14387         101          0.5        2129.2       1.0X

================================================================================================
UnsafeShuffleWriter write
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
UnsafeShuffleWriter without spills:       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
small dataset without spills                         22             27          15          0.0       21697.4       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Linux 4.15.0-1014-gcp
Intel(R) Xeon(R) CPU @ 2.30GHz
UnsafeShuffleWriter with spills:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
without transferTo                                15908          16027         114          0.8        1185.3       1.0X
with transferTo                                   15885          16005         105          0.8        1183.6       1.0X

…535) * Propose a new NIO transfer API for partition writing. This solves the consistency and resource leakage concerns with the first iteration of thie API, where it would not be obvious that the streamable resources created by ShufflePartitionWriter needed to be closed by ShuffleParittionWriter#close as opposed to closing the resources directly. This introduces the following adjustments: - Channel-based writes are separated out to their own module, SupportsTransferTo. This allows the transfer-to APIs to be modified independently, and users that only provide output streams can ignore the NIO APIs entirely. This also allows us to mark the base ShufflePartitionWriter as a stable API eventually while keeping the NIO APIs marked as experimental or developer-api. - We add APIs that explicitly encodes the notion of transferring bytes from one source to another. The partition writer returns an instance of TransferrableWritableByteChannel, which has APIs for accepting a TransferrableReadableByteChannel and can tell the readable byte channel to transfer its bytes out to some destination sink. - The resources returned by ShufflePartitionWriter are always closed. Internally, DefaultMapOutputWriter keeps resources open until commitAllPartitions() is called. * Migrate unsafe shuffle writer to use new byte channel API. * More sane implementation for unsafe * Fix style * Address comments * Fix imports * Fix build * Fix more build problems * Address comments.

mccheah changed the title ~~Propose a new NIO transfer API for partition writing.~~ [SPARK-25299] Propose a new NIO transfer API for partition writing. Apr 15, 2019

mccheah added 4 commits April 16, 2019 19:21

Merge remote-tracking branch 'palantir/spark-25299' into proposed-new…

81e8a86

…-transferto-api

Migrate unsafe shuffle writer to use new byte channel API.

53f6bbd

More sane implementation for unsafe

9b77268

Fix style

0dd4ffa

ifilonenko reviewed Apr 18, 2019

View reviewed changes

ifilonenko approved these changes Apr 22, 2019

View reviewed changes

Merge remote-tracking branch 'palantir/spark-25299' into proposed-new…

e98661e

…-transferto-api

yifeih reviewed May 1, 2019

View reviewed changes

squito reviewed May 1, 2019

View reviewed changes

mccheah added 4 commits May 1, 2019 15:37

Address comments

1faf980

Fix imports

d12a86c

Fix build

5ae75de

Fix more build problems

ce68613

squito reviewed May 2, 2019

View reviewed changes

Address comments.

f3dac6e

mccheah merged commit 62c2664 into spark-25299 May 24, 2019

mccheah deleted the proposed-new-transferto-api branch May 24, 2019 19:15

svc-spark-25299 reviewed May 24, 2019

View reviewed changes

mccheah mentioned this pull request Jul 8, 2019

[SPARK-28209][CORE][SHUFFLE] Proposed new shuffle writer API apache/spark#25007

Closed

[SPARK-25299] Propose a new NIO transfer API for partition writing. #535

[SPARK-25299] Propose a new NIO transfer API for partition writing. #535

Uh oh!

Conversation

mccheah commented Apr 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ifilonenko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ifilonenko left a comment

Choose a reason for hiding this comment

Uh oh!

mccheah commented Apr 29, 2019

Uh oh!

yifeih left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah May 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah commented Apr 15, 2019 •

edited

Loading

mccheah May 1, 2019 •

edited

Loading