[SPARK-21765] Set isStreaming on leaf nodes for streaming plans. #18973

jose-torres · 2017-08-17T18:03:52Z

What changes were proposed in this pull request?

All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from.

How was this patch tested?

Existing unit tests - no functional change is intended in this PR.

SparkQA · 2017-08-17T18:14:55Z

Test build #80794 has finished for PR 18973 at commit 86a3de9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-17T18:31:32Z

Test build #80796 has finished for PR 18973 at commit 3b7eb80.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-08-17T19:56:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala

Can you add docs to explain what isStreaming is?

Done. (I think this is a correct summary?)

Make sure this is same as the updated isStreaming docs (see my other comments)

tdas · 2017-08-17T19:59:51Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/LogicalPlanSuite.scala

Rather than change this, just use the 3 param version of LocalRelation

tdas · 2017-08-17T20:08:14Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

just isStreaming is fine. isStreaming = isStreaming is overkill. Its only useful when the value is a constant. E.g. isStreaming = true

It's necessary here because there are two other default arguments in the constructor.

SparkQA · 2017-08-17T20:18:11Z

Test build #80798 has finished for PR 18973 at commit 6e1dd50.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-18T00:00:48Z

Test build #80811 has finished for PR 18973 at commit e5e962b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-19T02:22:03Z

Test build #80864 has finished for PR 18973 at commit 28c2f4b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DebugForeachWriter[A : Encoder]() extends ForeachWriter[A]

SparkQA · 2017-08-19T02:45:37Z

Test build #80863 has finished for PR 18973 at commit 60a3586.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
class DebugForeachWriter[A : Encoder]() extends ForeachWriter[A]

SparkQA · 2017-08-19T06:01:05Z

Test build #80867 has finished for PR 18973 at commit ac7d785.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-08-21T20:31:17Z

test this please.

tdas

Can you update the scala docs in LogicalPlan.isStreaming to say that isStreaming = has data from a streaming source (i.e. need not have a streaming source).

Accordingly update other comments defined on isStreaming in the leaves.

tdas · 2017-08-21T21:27:27Z

We should not require DropDuplicates( ... iStreaming) any more. Can you remove it?

It's now redundant with LogicalPlan.isStreaming.

…ng source.

jose-torres · 2017-08-21T22:14:04Z

Addressed comments from @tdas

tdas · 2017-08-21T22:17:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala


-    logDebug(
-      s"MemoryBatch [$startOrdinal, $endOrdinal]: ${newBlocks.flatMap(_.collect()).mkString(", ")}")
+    logDebug({


Please make this a separate function. It's weird to have so much code inside logDebug

Actually, this does not need to be so complicated. See how I have disabled UninterruptedOperationChecker to do a collect() in FileStreamSourceSuite

tdas · 2017-08-21T22:21:01Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

      override def getBatch(start: Option[Offset], end: Offset): DataFrame = {
        val startOffset = start.map(_.asInstanceOf[LongOffset].offset).getOrElse(-1L) + 1
-        spark.range(startOffset, end.asInstanceOf[LongOffset].offset + 1).toDF("a")
+        val ds = new Dataset[java.lang.Long](


Cant you use createInternalDataFrame out here?
Also add a comment about the fact you are trying to ensure isStreaming is true.

You dont even need Range LogicalPlan. Since its for debugging, you can directly create a DF from local seq startOffset to endOffset

I've tried addressing this a few different ways, and I can't come up with anything cleaner than the current solution. Directly creating a DF doesn't set the isStreaming bit, and a bunch of copying and casting is required to get it set; using LocalRelation requires explicitly handling the encoding of the rows, since LocalRelation requires InternalRow input.

tdas · 2017-08-21T22:39:07Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

  private[sql]
-  def internalCreateDataFrame(catalystRows: RDD[InternalRow], schema: StructType) = {
-    sparkSession.internalCreateDataFrame(catalystRows, schema)
+  def internalCreateDataFrame(catalystRows: RDD[InternalRow],


nit: The correct code style for multiline param definition is

def function( param1: type1, // double indent, i.e. 4 spaces param2: type2)

See the indentation section in http://spark.apache.org/contributing.html

SparkQA · 2017-08-21T23:24:51Z

Test build #80939 has finished for PR 18973 at commit ac7d785.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-22T00:46:21Z

Test build #80942 has finished for PR 18973 at commit e55abe6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-22T00:49:49Z

Test build #80943 has finished for PR 18973 at commit c837069.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ming plans.

SparkQA · 2017-08-22T05:14:42Z

Test build #80952 has finished for PR 18973 at commit 8857cf5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-08-22T20:40:52Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala

    assert(progress.sources(0).numInputRows === 10)
  }

+  test("[SPARK-19690] stream join with aggregate batch query succeeds") {


Can you move this to the StreamingAggregationSuite? Because that suite is closely related to this aggregation bug. And I would rename it to "SPARK-19690: do not convert batch aggregation in streaming query to streaming aggregation"

Also, I would actually test whether the output is correct or not. See other tests in StreamingAggregationSuite

tdas · 2017-08-22T20:57:31Z

One comment regarding location of the aggregation test. Other than that LGTM.

tdas · 2017-08-22T22:57:07Z

LGTM pending tests.

SparkQA · 2017-08-23T00:25:47Z

Test build #81008 has finished for PR 18973 at commit fd725bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-23T01:31:05Z

Test build #81010 has finished for PR 18973 at commit 8fd9053.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-08-23T02:03:24Z

Merging this to master. Thank you @Joseph-Torres !

cloud-fan · 2017-09-11T14:53:15Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

    numSlices: Option[Int],
-    output: Seq[Attribute])
+    output: Seq[Attribute],
+    override val isStreaming: Boolean)


how can a Range have data from a streaming source?

I don't think there's necessarily a reason it shouldn't be able to; streaming sources are free to define getBatch() however they'd like.

Right now the only source actually doing that is a fake source in StreamSuite.

tdas reviewed Aug 17, 2017

View reviewed changes

Jose Torres and others added 7 commits August 18, 2017 16:34

partial - Add isStreaming bit to all LogicalPlan leaves.

488a408

Fixed streaming tests

8dfe40e

remove spurious commenting out

dcdcf52

fix additional LogicalRelation pattern match after merge

2ec6154

fix more LogicalRelation matching in tests

f289d88

Address PR comments.

b760644

Fix MemoryStream debug log to not collect() a streaming dataset.

28c2f4b

jose-torres force-pushed the SPARK-21765 branch from 60a3586 to 28c2f4b Compare August 18, 2017 23:34

Make DebugForeachWriter private.

ac7d785

tdas reviewed Aug 21, 2017

View reviewed changes

Jose Torres added 2 commits August 21, 2017 15:02

Remove streaming bit from Deduplicate.

e55abe6

It's now redundant with LogicalPlan.isStreaming.

Fix isStreaming comment to clarify only data must come from a streami…

c837069

…ng source.

tdas reviewed Aug 21, 2017

View reviewed changes

Jose Torres added 3 commits August 21, 2017 18:34

Simplify MemoryStream debug logging.

09352ba

fix indentation

3f11fac

Fix StatefulAggregationStrategy to ignore non-streaming aggs in strea…

8857cf5

…ming plans.

tdas reviewed Aug 22, 2017

View reviewed changes

Jose Torres added 3 commits August 22, 2017 14:41

Move batch aggregate conversion test between suites.

fd725bb

Fix SPARK-19690 test failure modes

6a048bb

remove unneeded withTempDir

8fd9053

asfgit closed this in 3c0c2d0 Aug 23, 2017

cloud-fan reviewed Sep 11, 2017

View reviewed changes

jose-torres deleted the SPARK-21765 branch September 15, 2017 08:10

[SPARK-21765] Set isStreaming on leaf nodes for streaming plans. #18973

[SPARK-21765] Set isStreaming on leaf nodes for streaming plans. #18973

Uh oh!

Conversation

jose-torres commented Aug 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 17, 2017

Uh oh!

SparkQA commented Aug 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 17, 2017

Uh oh!

SparkQA commented Aug 18, 2017

Uh oh!

SparkQA commented Aug 19, 2017

Uh oh!

SparkQA commented Aug 19, 2017

Uh oh!

SparkQA commented Aug 19, 2017

Uh oh!

tdas commented Aug 21, 2017

Uh oh!

tdas left a comment

Choose a reason for hiding this comment

Uh oh!

tdas commented Aug 21, 2017

Uh oh!

jose-torres commented Aug 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 21, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Aug 22, 2017

Uh oh!

tdas commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jose-torres commented Aug 17, 2017 •

edited

Loading

tdas Aug 22, 2017 •

edited

Loading

tdas commented Aug 22, 2017 •

edited

Loading

tdas commented Aug 23, 2017 •

edited

Loading