-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide #13945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #61380 has finished for PR 13945 at commit
|
| [Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java or Python to express streaming | ||
| aggregations, event-time windows, stream-to-batch joins, etc. The computation | ||
| is executed on the same optimized Spark SQL engine. Finally, the system | ||
| ensures end-to-end exactly-once fault-tolerance guarantees through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End-to-end exactly-once sounds like over-promising. Should probably define what the ends are, because destructive outputs can't be literally exactly-once in the face of network failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensures --> can ensure
|
|
||
| - `version` and `partition` are two parameter in the `open` that uniquely represents a set of rows that needs to be pushed out. `version` is monotonically increasing id that increases with every trigger. `partition` is an id that represents a partition of the output, since the output is distributed and will be processed on multiple executors. | ||
|
|
||
| - `open` can use the `version` and `partition` to choose whether it need to write the sequence of rows. Accordingly, it can return `true` (proceed with writing), or `false` (no need to write). If the `false` is returned, then `write` will not be called on any row. For example, after a partial failure, so partitions of the failed trigger may have already been committed to a database. Based on metadata stores in the database, the writer can identify partitions that have already been committed and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"whether it need" -> "whether it needs"
"If the false" -> "If false"
"so partitions" -> "some partitions"
"been committed and" ...? the end of this bullet seems to be missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whether it need to write => whether it needs to write
If the false is returned => If false is returned
partitions that have already been committed and => incomplete sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
Thank you very much everyone for the detailed review! I am really thankful you caught so many issues that I missed in my first pass. I have addressed your comments as well as more comments I have received offline. In the interest of Spark 2.0 release, I am going to prioritize merging this PR. If there are outstanding issues, lets solve them in follow up PRs. I am sure that we can improve this draft by a lot with everyone's contributions. |
|
Test build #61455 has finished for PR 13945 at commit
|
|
|
||
| Now consider what happens if one of the events arrives late to the application. | ||
| For example, a word that was generated at 12:04 but it was received at 12:11. | ||
| Since this windowing is based on the time in the data, the time 12:04 should considered for windowing. This occurs naturally in our window-based grouping --the late data is automatically placed in the proper windows and the correct aggregates updated as illustrated below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor corrections.
- the time 12:04, should be considered for windowing.
- grouping - the late data
|
@ScrapCodes thanks for catching those. I will update them in a follow up PR. I am merging this as is to master and 2.0 in the interest of making it to Spark 2.0 RC2 |
…Guide Title defines all. Author: Tathagata Das <[email protected]> Closes #13945 from tdas/SPARK-16256. (cherry picked from commit 64132a1) Signed-off-by: Tathagata Das <[email protected]>
|
I have opened up another PR #13978 with left over fixes. |
Title defines all.