Skip to content

Conversation

@jose-torres
Copy link
Contributor

@jose-torres jose-torres commented Aug 17, 2017

What changes were proposed in this pull request?

All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from.

How was this patch tested?

Existing unit tests - no functional change is intended in this PR.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80794 has finished for PR 18973 at commit 86a3de9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80796 has finished for PR 18973 at commit 3b7eb80.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docs to explain what isStreaming is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. (I think this is a correct summary?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this is same as the updated isStreaming docs (see my other comments)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than change this, just use the 3 param version of LocalRelation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just isStreaming is fine. isStreaming = isStreaming is overkill. Its only useful when the value is a constant. E.g. isStreaming = true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's necessary here because there are two other default arguments in the constructor.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80798 has finished for PR 18973 at commit 6e1dd50.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80811 has finished for PR 18973 at commit e5e962b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2017

Test build #80864 has finished for PR 18973 at commit 28c2f4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DebugForeachWriter[A : Encoder]() extends ForeachWriter[A]

@SparkQA
Copy link

SparkQA commented Aug 19, 2017

Test build #80863 has finished for PR 18973 at commit 60a3586.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • class DebugForeachWriter[A : Encoder]() extends ForeachWriter[A]

@SparkQA
Copy link

SparkQA commented Aug 19, 2017

Test build #80867 has finished for PR 18973 at commit ac7d785.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Aug 21, 2017

test this please.

Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the scala docs in LogicalPlan.isStreaming to say that isStreaming = has data from a streaming source (i.e. need not have a streaming source).

Accordingly update other comments defined on isStreaming in the leaves.

@tdas
Copy link
Contributor

tdas commented Aug 21, 2017

We should not require DropDuplicates( ... iStreaming) any more. Can you remove it?

Jose Torres added 2 commits August 21, 2017 15:02
@jose-torres
Copy link
Contributor Author

Addressed comments from @tdas


logDebug(
s"MemoryBatch [$startOrdinal, $endOrdinal]: ${newBlocks.flatMap(_.collect()).mkString(", ")}")
logDebug({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a separate function. It's weird to have so much code inside logDebug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this does not need to be so complicated. See how I have disabled UninterruptedOperationChecker to do a collect() in FileStreamSourceSuite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

override def getBatch(start: Option[Offset], end: Offset): DataFrame = {
val startOffset = start.map(_.asInstanceOf[LongOffset].offset).getOrElse(-1L) + 1
spark.range(startOffset, end.asInstanceOf[LongOffset].offset + 1).toDF("a")
val ds = new Dataset[java.lang.Long](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cant you use createInternalDataFrame out here?
Also add a comment about the fact you are trying to ensure isStreaming is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You dont even need Range LogicalPlan. Since its for debugging, you can directly create a DF from local seq startOffset to endOffset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried addressing this a few different ways, and I can't come up with anything cleaner than the current solution. Directly creating a DF doesn't set the isStreaming bit, and a bunch of copying and casting is required to get it set; using LocalRelation requires explicitly handling the encoding of the rows, since LocalRelation requires InternalRow input.

private[sql]
def internalCreateDataFrame(catalystRows: RDD[InternalRow], schema: StructType) = {
sparkSession.internalCreateDataFrame(catalystRows, schema)
def internalCreateDataFrame(catalystRows: RDD[InternalRow],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The correct code style for multiline param definition is

def function(
    param1: type1,      // double indent, i.e. 4 spaces
    param2: type2)    

See the indentation section in http://spark.apache.org/contributing.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #80939 has finished for PR 18973 at commit ac7d785.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80942 has finished for PR 18973 at commit e55abe6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80943 has finished for PR 18973 at commit c837069.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80952 has finished for PR 18973 at commit 8857cf5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(progress.sources(0).numInputRows === 10)
}

test("[SPARK-19690] stream join with aggregate batch query succeeds") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the StreamingAggregationSuite? Because that suite is closely related to this aggregation bug. And I would rename it to "SPARK-19690: do not convert batch aggregation in streaming query to streaming aggregation"

Copy link
Contributor

@tdas tdas Aug 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would actually test whether the output is correct or not. See other tests in StreamingAggregationSuite

@tdas
Copy link
Contributor

tdas commented Aug 22, 2017

One comment regarding location of the aggregation test. Other than that LGTM.

@tdas
Copy link
Contributor

tdas commented Aug 22, 2017

LGTM pending tests.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81008 has finished for PR 18973 at commit fd725bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81010 has finished for PR 18973 at commit 8fd9053.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Aug 23, 2017

Merging this to master. Thank you @Joseph-Torres !

@asfgit asfgit closed this in 3c0c2d0 Aug 23, 2017
numSlices: Option[Int],
output: Seq[Attribute])
output: Seq[Attribute],
override val isStreaming: Boolean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can a Range have data from a streaming source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's necessarily a reason it shouldn't be able to; streaming sources are free to define getBatch() however they'd like.

Right now the only source actually doing that is a fake source in StreamSuite.

@jose-torres jose-torres deleted the SPARK-21765 branch September 15, 2017 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants