[SPARK-24235][SS] Implement continuous shuffle writer for single reader partition. #21428

jose-torres · 2018-05-25T00:30:13Z

What changes were proposed in this pull request?

https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit

Implement continuous shuffle write RDD for a single reader partition. (I don't believe any implementation changes are actually required for multiple reader partitions, but this PR is already very large, so I want to exclude those for now to keep the size down.)

How was this patch tested?

new unit tests

SparkQA · 2018-05-25T03:16:55Z

Test build #91131 has finished for PR 21428 at commit 63d38d8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-25T03:40:03Z

Test build #91132 has finished for PR 21428 at commit f3ce675.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-25T04:47:28Z

Test build #91135 has finished for PR 21428 at commit e0108d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

Looking good, but needs a few improvements.

tdas · 2018-05-25T20:28:29Z

.../org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleWriteRDD.scala

+import org.apache.spark.sql.execution.streaming.continuous.{ContinuousExecution, EpochTracker}
+
+/**
+ * An RDD which continuously writes epochs from its child into a continuous shuffle.


nit: writes epoch data

tdas · 2018-05-25T20:36:42Z

.../org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleWriteRDD.scala

+    var prev: RDD[UnsafeRow],
+    outputPartitioner: Partitioner,
+    endpoints: Seq[RpcEndpointRef])
+    extends RDD[Unit](prev) {


fix indent.

tdas · 2018-05-25T22:05:24Z

...main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/UnsafeRowWriter.scala

+ * @param endpoints The [[UnsafeRowReceiver]] endpoints to write to. Indexed by partition ID within
+ *                  outputPartitioner.
+ */
+class UnsafeRowWriter(


Looking at this PR and prev PRs, I think the names UnsafeRowWriter and UnsafeRowReader are not right. The basic interfaces ContinuousShuffleReader/Writer take UnsafeRows, hence that's not unique to this implementation (that is, all implementation of these interfaces will read/write UnsafeRows). What's unique is that this implementation uses RPC mechanism to read/write. So it may be more accurate to name them RPCContinuousShuffleReader/Writer, or something like that.

tdas · 2018-05-25T22:14:25Z

...main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/UnsafeRowWriter.scala

+import org.apache.spark.sql.catalyst.expressions.UnsafeRow
+
+/**
+ * A [[ContinuousShuffleWriter]] sending data to [[UnsafeRowReceiver]] instances.


Another thought, not something that needs to be done now. But it might be overall cleaner if the ContinuousShuffleWriter and Reader are coupled together in a joint interface. This is because each writer implementation is always tied to a specific reader implementation, so they are always coupled together. Consider something like this.

trait ContinuousShuffleManager { def createWriter(writerId: Int, numReaders: Int): ContinuousShuffleWriter def createReader(readerId: Int, numWriters: Int): ContinuousShuffleReader }

I am just guessing that the params on the createX interfaces, I might be missing something. But I feel that a small set of params should be sufficient for any implementation figure out everything else. Also, other management/control layer stuff will go into the manager implementation. Like, for example, if the writers and readers need to exchange initial setup information (e.g. RPC endpoint details) through the driver, then the implementation of that would go into the manager.

Think about it as your building out rest of the architecture.

tdas · 2018-05-25T22:18:38Z

...la/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleWriter.scala

+ * Trait for writing to a continuous processing shuffle.
+ */
+trait ContinuousShuffleWriter {
+  def write(epoch: Iterator[UnsafeRow]): Unit


I dont think its the right interface. The ContinuousShuffleWriter interface should be for writing the shuffled rows. The implementation should not be responsible for actually deciding partitions (i.e. outputPartitioner.getPartition(row)), as you dont want to re-implement the partitioning in every implementation. So I think the interface should be def write(row: UnsafeRow, partitionId: Int)

I think it's better encapsulation to re-implement the partitioning in every ContinuousShuffleWriter implementation than to re-implement it in every ContinuousShuffleWriter user. (Note that the non-continuous ShuffleWriter has precedent for this: it uses the same interface, and all implementations of ShuffleWriter do re-implement partitioning.)

I see. That's fair.

tdas · 2018-05-25T22:21:50Z

.../org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleWriteRDD.scala

+      EpochTracker.incrementCurrentEpoch()
+    }
+
+    Iterator()


Seems like you dont really need a RDD here, you just need an action. You are consuming an iterator and returning nothing... that exactly like a rdd.foreachPartition. It may be so that wrapping it in this RDD is cleaner in the bigger picture, but I am unable to judge without having the bigger picture in mind (bigger picture = how are these Continuous*RDDs going to be create by SQL SparkPlan, and executed).

I honestly just did this to mirror ContinuousWriteRDD, which itself mirrored WriteToDataSourceV2Exec returning an empty RDD. We can take it out of the current PR - it's not being used anywhere yet, and I agree that where it ends up being used will determine the right interface.

tdas · 2018-05-25T22:46:31Z

...rc/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala

+import org.apache.spark.sql.streaming.StreamTest
+import org.apache.spark.sql.types.{DataType, IntegerType}
+
+class ContinuousShuffleSuite extends StreamTest {


Discussed offline. Merged these tests into the earlier test suite. Name the combined one appropriately.

jose-torres · 2018-05-25T22:58:31Z

@HeartSaVioR @arunmahadevan @xuanyuanking

jose-torres · 2018-05-25T23:50:33Z

addressed comments

SparkQA · 2018-05-26T03:22:09Z

Test build #91174 has finished for PR 21428 at commit af1508c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-05-29T18:34:28Z

...rc/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala

-    TaskContext.unset()
-    ctx = null
-    super.afterEach()
+  test("one epoch") {


nit: i generally put the simplest test first (likely to be the reader tests since they dont depend on writer) and the more complex, e2e-ish tests later (writers since they needs readers).

tdas · 2018-05-30T00:43:56Z

...rc/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala

+    eventually(timeout(streamingTimeout)) {
+      assert(receiver.asInstanceOf[RPCContinuousShuffleReader].stopped.get())
+    }
+  }


there isnt a test where a RPCContinuousShuffleWriter writes to multiple reader endpoints.

Discussed offline and above - this is a deliberate limitation of the PR.

arunmahadevan · 2018-05-30T00:30:50Z

...org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleWriter.scala

+ *                          partition ID within outputPartitioner.
+ */
+class RPCContinuousShuffleWriter(
+    writerId: Int,


nit: rename to partitionId?

I worry that partitionId is ambiguous with the partition to which the shuffle data is being written.

ok makes sense.

arunmahadevan · 2018-05-30T00:54:33Z

...org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleWriter.scala

+  def write(epoch: Iterator[UnsafeRow]): Unit = {
+    while (epoch.hasNext) {
+      val row = epoch.next()
+      endpoints(outputPartitioner.getPartition(row)).ask[Unit](ReceiverRow(writerId, row))


What about the case where the send fails? the result seem to be ignored here..

cc @zsxwing

It's my understanding that the RPC framework guarantees messages will be sent in the order that they're ask()ed, and that it's therefore not possible for a single row to fail to be sent while the ones before and after it succeed. If this is the case, then we don't need to handle it here - the query will just start failing to make progress.

If it's not the case, we'll need a more clever solution. Maybe have the epoch marker message contain a count for the number of rows that are supposed to be in the epoch?

A reliable channel (first case) seems like a requirement for correctness. In that case I think the query can just be restarted from the last successful epoch as soon as a failure is detected.

Discussed offline with @zsxwing. It's actually not valid to be sending these async at all - the framework will retry e.g. connection failures on the next row, so we can end up committing an epoch before we detect that a row within it has failed to send. We need to just make these synchronous.

This will incur a slight round-trip latency penalty for now, but as mentioned earlier the TCP-based shuffle is what we actually plan to be production quality. I'm hoping to begin work on it after I finish one more PR on top of this. So I think the latency should be fine for now.

arunmahadevan · 2018-05-30T01:02:32Z

...org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleWriter.scala

+    outputPartitioner: Partitioner,
+    endpoints: Array[RpcEndpointRef]) extends ContinuousShuffleWriter {
+
+  if (outputPartitioner.numPartitions != 1) {


any reason to disable it ? this should work rt?

I believe so, but there's no way to test whether it will work until we implement the scheduling support for distributing the addresses of each of the multiple readers.

SparkQA · 2018-05-30T02:38:52Z

Test build #91278 has finished for PR 21428 at commit 65837ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR

Looks good. Only minor and nits.

HeartSaVioR · 2018-05-29T23:31:07Z

...rc/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala

-      ReceiverEpochMarker(0),
-      ReceiverRow(0, unsafeRow(111))
-    )
+  private implicit def unsafeRow(value: Int) = {


Just curious: is there a reason to rearrange functions, this and below twos? Looks like they're same except changing this function to implicit.

And where it leverages the implicit attribute of this method? I'm not sure it is really needed, but I'm review on Github page so I might be missing here.

writer.write(Iterator(1, 2, 3)) and such leverages the implicit.

HeartSaVioR · 2018-05-30T02:47:00Z

...rc/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala

+    readRowThread.join()
+  }
+
+  test("multiple writer partitions") {


Would we want to have another test which covers out-of-order epoch between writers (if that's valid case for us), or rely on the test in ContinuousShuffleReadRDD?

I think "reader epoch only ends when all writer partitions write it" is a sufficient test for that.

xuanyuanking · 2018-05-30T12:32:44Z

...org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleReader.scala

 * source tasks have sent one.
 */
-private[shuffle] class UnsafeRowReceiver(
+private[shuffle] class RPCContinuousShuffleReader(


nit: If we need the rename here, how about other messages name and comments?
https://github.com/apache/spark/pull/21428/files#diff-4072457048f805637bfce2c779608756R29
https://github.com/apache/spark/pull/21428/files#diff-4072457048f805637bfce2c779608756R35

Good point. Caught what I think are the rest.

SparkQA · 2018-06-01T02:23:29Z

Test build #91368 has finished for PR 21428 at commit 629455b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Overall looks good. Left some minor comments.

zsxwing · 2018-06-12T22:16:25Z

...org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleReader.scala

  override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
-    case r: UnsafeRowReceiverMessage =>
+    case r: RPCContinuousShuffleMessage =>
      queues(r.writerId).put(r)


This line may block PRC threads and cause some critical RPC messages delayed. In addition, if the reader fails, this line may block forever if the queue is full.

I'm okey with this right now since it's an experimental feature. Could you create a SPARK ticket and add a TODO here to comment the potential issue so that we won't forget this issue?

I'm not sure what a critical RPC message is in this context. This line is intended to block forever if the queue is full; the receiver should not take any action or accept any other messages until the queue stops being full.

All RPC messages inside Spark are processed in a shared fixed thread pool, hence we cannot run blocking calls inside a RPC thread.

I think we need to design a backpressure mechanism in future fundamentally because a receiver cannot block a sender sending data. For example, even if we block here, we still cannot prevent the sender sending data and they will finally fulfill the TCP buffer. We cannot just count on TCP backpressure here as we need to use the same TCP connection in order to support thousands of machines.

That's a very strange characteristic for an RPC framework.

I don't know what backpressure could mean other than a receiver blocking a sender from sending more data. In any case, the final shuffle mechanism isn't going to use the RPC framework, so I added a reference to it. (We can discuss in a later PR whether we want to leave this mechanism lying around or remove it once we're confident the TCP-based one is working.)

zsxwing · 2018-06-12T22:27:26Z

...org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleWriter.scala

+/**
+ * A [[ContinuousShuffleWriter]] sending data to [[RPCContinuousShuffleReader]] instances.
+ *
+ * @param writerId          The partition ID of this writer.


nit: we don't use vertical alignment as they will introduce unnecessary changes in future.

zsxwing · 2018-06-12T22:47:20Z

...org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleWriter.scala

+      endpoints(outputPartitioner.getPartition(row)).askSync[Unit](ReceiverRow(writerId, row))
+    }
+
+    endpoints.foreach(_.askSync[Unit](ReceiverEpochMarker(writerId)))


you can use Future.sequence to send messages in parallel, such as

import scala.concurrent.Future import scala.concurrent.duration.Duration import org.apache.spark.util.ThreadUtils val futures = endpoints.map(_.ask[Unit](ReceiverEpochMarker(writerId))) implicit val ec = ThreadUtils.sameThread ThreadUtils.awaitResult(Future.sequence(futures), Duration.Inf)

Sure, but I don't think there's any benefit to doing so. We need to sequence the messages across epochs too, so there's little parallelization available that way.

As far as I understand, the code here is to send a ReceiverEpochMarker to each endpoint and wait for all of them to response. You can send ReceiverEpochMarkers in parallel rather than send and wait one by one.

zsxwing · 2018-06-12T22:52:43Z

...a/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala

-    val endpoint = env.setupEndpoint(s"UnsafeRowReceiver-${UUID.randomUUID()}", receiver)
+    val receiver = new RPCContinuousShuffleReader(
+      queueSize, numShuffleWriters, epochIntervalMs, env)
+    val endpoint = env.setupEndpoint(s"RPCContinuousShuffleReader-${UUID.randomUUID()}", receiver)


Is it possible to get the query run id here? It would be helpful to debug if the endpoint name contains the query run id and partition id.

It requires a reasonable amount of extra code. As mentioned, this is not the final shuffle mechanism (and I intend to have the TCP-based shuffle ready to go in the next Spark release).

zsxwing · 2018-06-12T23:03:48Z

...rc/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala

+
+    // Once we write the epoch the thread should stop waiting and succeed.
+    writer.write(Iterator(1))
+    readRowThread.join()


nit: it's better to add a timeout here, such as readRowThread.join(streamingTimeout.toMillis). Without a timeout, if there is a bug causing this hang, we will need to wait until the jenkins build timeout, which is much longer.

zsxwing · 2018-06-12T23:04:08Z

...rc/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala

+    }
+
+    writers(0).write(Iterator())
+    readEpochMarkerThread.join()


zsxwing · 2018-06-12T23:14:42Z

...org/apache/spark/sql/execution/streaming/continuous/shuffle/RPCContinuousShuffleReader.scala


      private val executor = Executors.newFixedThreadPool(numShuffleWriters)
-      private val completion = new ExecutorCompletionService[UnsafeRowReceiverMessage](executor)
+      private val completion = new ExecutorCompletionService[RPCContinuousShuffleMessage](executor)


Are you planning to implement round-robin here? Otherwise, using an array of queries + a thread pool can be just replaced with a blocking queue.

It cannot be. There's a deadlock scenario where the queue is filled with records from epoch N before all writers have sent the marker for epoch N - 1.

SparkQA · 2018-06-13T06:00:01Z

Test build #91745 has finished for PR 21428 at commit 4bbdeae.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-13T06:39:44Z

Test build #91748 has finished for PR 21428 at commit e57531d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-13T07:05:01Z

Test build #91742 has finished for PR 21428 at commit 59d6ff7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-06-13T13:54:36Z

retest this please

SparkQA · 2018-06-13T14:03:58Z

Test build #91772 has finished for PR 21428 at commit e57531d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-06-13T17:09:20Z

LGTM pending tests

SparkQA · 2018-06-13T19:14:57Z

Test build #91781 has finished for PR 21428 at commit cff37c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-06-13T20:12:05Z

Thanks! Merging to master.

jose-torres added 20 commits May 15, 2018 11:08

continuous shuffle read RDD

1d6b718

docs

b5d1008

Merge remote-tracking branch 'apache/master' into readerRddMaster

af40769

fix ctor

46456dc

multiple partition test

2ea8a6f

unset task context after test

955ac79

conf from RDD

8cefb72

endpoint name

f91bfe7

testing bool

2590292

tests

859e6e4

take instead of poll

b23b7bb

add interface

97f7e8f

clarify comment

de21b1c

multiple

7dcf51a

writer with 1 reader partition

ad0b5aa

docs and iface

c9adee5

Merge remote-tracking branch 'apache/master' into writerTask

63d38d8

increment epoch

331f437

undo oop

f3ce675

make rdd loop

e0108d7

tdas suggested changes May 25, 2018

View reviewed changes

remote write RDD

f400651

jose-torres changed the title ~~[SPARK-24235][SS] Implement continuous shuffle write RDD for single reader partition.~~ [SPARK-24235][SS] Implement continuous shuffle writer for single reader partition. May 25, 2018

jose-torres added 2 commits May 25, 2018 16:32

rename classes

1aaad8d

combine suites

59890d4

fully rm old suite

af1508c

tdas reviewed May 29, 2018

View reviewed changes

reorder tests

65837ac

tdas reviewed May 30, 2018

View reviewed changes

arunmahadevan reviewed May 30, 2018

View reviewed changes

HeartSaVioR reviewed May 30, 2018

View reviewed changes

xuanyuanking reviewed May 30, 2018

View reviewed changes

jose-torres added 4 commits May 31, 2018 15:17

return future

a68fae2

finish getting rid of old name

98d55e4

synchronous

e6b9118

finish rename

629455b

zsxwing requested changes Jun 12, 2018

View reviewed changes

jose-torres added 4 commits June 12, 2018 21:02

add timeouts

cb6d42b

unalign

59d6ff7

add note

f90388c

parallel

4bbdeae

fix compile

e57531d

fix compile

cff37c4

asfgit closed this in 1b46f41 Jun 13, 2018

[SPARK-24235][SS] Implement continuous shuffle writer for single reader partition. #21428

[SPARK-24235][SS] Implement continuous shuffle writer for single reader partition. #21428

Uh oh!

Conversation

jose-torres commented May 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 25, 2018

Uh oh!

SparkQA commented May 25, 2018

Uh oh!

SparkQA commented May 25, 2018

Uh oh!

tdas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jose-torres commented May 25, 2018

Uh oh!

jose-torres commented May 25, 2018

Uh oh!

SparkQA commented May 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jose-torres May 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 30, 2018

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jose-torres commented May 25, 2018 •

edited

Loading

jose-torres May 31, 2018 •

edited

Loading