[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide #13945

tdas · 2016-06-28T14:31:28Z

Title defines all.

SparkQA · 2016-06-28T14:51:32Z

Test build #61380 has finished for PR 13945 at commit 74108e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class DeviceData(device: String, type: String, signal: Double, time: DateTime)
- - The writer must do all the initialization (e.g. opening connections, starting a transaction, etc.) only when theopenmethod is called. Be aware that, if there is any initialization in the class as soon as the object is created, then that initialization will happen in the driver (because that is where the instance is being created), which may not be what you intend.

koeninger · 2016-06-28T20:22:33Z

docs/structured-streaming-programming-guide.md

+[Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java or Python to express streaming 
+aggregations, event-time windows, stream-to-batch joins, etc. The computation 
+is executed on the same optimized Spark SQL engine. Finally, the system 
+ensures end-to-end exactly-once fault-tolerance guarantees through 


End-to-end exactly-once sounds like over-promising. Should probably define what the ends are, because destructive outputs can't be literally exactly-once in the face of network failures.

ensures --> can ensure

sethah · 2016-06-28T21:13:42Z

docs/structured-streaming-programming-guide.md

+
+- `version` and `partition` are two parameter in the `open` that uniquely represents a set of rows that needs to be pushed out. `version` is monotonically increasing id that increases with every trigger. `partition` is an id that represents a partition of the output, since the output is distributed and will be processed on multiple executors.
+
+- `open` can use the `version` and `partition` to choose whether it need to write the sequence of rows. Accordingly, it can return `true` (proceed with writing), or `false` (no need to write). If the `false` is returned, then `write` will not be called on any row. For example, after a partial failure, so partitions of the failed trigger may have already been committed to a database. Based on metadata stores in the database, the writer can identify partitions that have already been committed and 


"whether it need" -> "whether it needs"
"If the false" -> "If false"
"so partitions" -> "some partitions"

"been committed and" ...? the end of this bullet seems to be missing

whether it need to write => whether it needs to write

If the false is returned => If false is returned

partitions that have already been committed and => incomplete sentence?

tdas · 2016-06-29T08:42:19Z

Thank you very much everyone for the detailed review! I am really thankful you caught so many issues that I missed in my first pass. I have addressed your comments as well as more comments I have received offline.

In the interest of Spark 2.0 release, I am going to prioritize merging this PR. If there are outstanding issues, lets solve them in follow up PRs. I am sure that we can improve this draft by a lot with everyone's contributions.

SparkQA · 2016-06-29T08:59:29Z

Test build #61455 has finished for PR 13945 at commit 78223a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ScrapCodes · 2016-06-29T10:11:10Z

docs/structured-streaming-programming-guide.md

+
+Now consider what happens if one of the events arrives late to the application.
+For example, a word that was generated at 12:04 but it was received at 12:11. 
+Since this windowing is based on the time in the data, the time 12:04 should considered for windowing. This occurs naturally in our window-based grouping --the late data is automatically placed in the proper windows and the correct aggregates updated as illustrated below.


Couple of minor corrections.

the time 12:04, should be considered for windowing.

grouping - the late data

tdas · 2016-06-29T18:45:39Z

@ScrapCodes thanks for catching those. I will update them in a follow up PR. I am merging this as is to master and 2.0 in the interest of making it to Spark 2.0 RC2

…Guide Title defines all. Author: Tathagata Das <[email protected]> Closes #13945 from tdas/SPARK-16256. (cherry picked from commit 64132a1) Signed-off-by: Tathagata Das <[email protected]>

tdas · 2016-06-29T18:52:27Z

I have opened up another PR #13978 with left over fixes.

Added Structured Streaming Programming Guide

74108e5

koeninger reviewed Jun 28, 2016
View reviewed changes

sethah reviewed Jun 28, 2016
View reviewed changes

tdas added 3 commits June 29, 2016 00:26

Added links and addressed comments

5bf043a

Merge remote-tracking branch 'apache-github/master' into SPARK-16256

5276d80

Addressed more comments

78223a4

ScrapCodes reviewed Jun 29, 2016
View reviewed changes

asfgit closed this in 64132a1 Jun 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide #13945

[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide #13945

Uh oh!

tdas commented Jun 28, 2016

Uh oh!

SparkQA commented Jun 28, 2016

Uh oh!

koeninger Jun 28, 2016

Uh oh!

tdas Jun 29, 2016

Uh oh!

sethah Jun 28, 2016

Uh oh!

koeninger Jun 28, 2016

Uh oh!

tdas Jun 29, 2016

Uh oh!

tdas commented Jun 29, 2016

Uh oh!

SparkQA commented Jun 29, 2016

Uh oh!

ScrapCodes Jun 29, 2016 •

edited

Loading

Uh oh!

tdas commented Jun 29, 2016

Uh oh!

tdas commented Jun 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants


		- `version` and `partition` are two parameter in the `open` that uniquely represents a set of rows that needs to be pushed out. `version` is monotonically increasing id that increases with every trigger. `partition` is an id that represents a partition of the output, since the output is distributed and will be processed on multiple executors.

		- `open` can use the `version` and `partition` to choose whether it need to write the sequence of rows. Accordingly, it can return `true` (proceed with writing), or `false` (no need to write). If the `false` is returned, then `write` will not be called on any row. For example, after a partial failure, so partitions of the failed trigger may have already been committed to a database. Based on metadata stores in the database, the writer can identify partitions that have already been committed and

[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide #13945

[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide #13945

Uh oh!

Conversation

tdas commented Jun 28, 2016

Uh oh!

SparkQA commented Jun 28, 2016

Uh oh!

koeninger Jun 28, 2016

Choose a reason for hiding this comment

Uh oh!

tdas Jun 29, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Jun 28, 2016

Choose a reason for hiding this comment

Uh oh!

koeninger Jun 28, 2016

Choose a reason for hiding this comment

Uh oh!

tdas Jun 29, 2016

Choose a reason for hiding this comment

Uh oh!

tdas commented Jun 29, 2016

Uh oh!

SparkQA commented Jun 29, 2016

Uh oh!

ScrapCodes Jun 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Jun 29, 2016

Uh oh!

tdas commented Jun 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ScrapCodes Jun 29, 2016 •

edited

Loading