Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Jun 28, 2016

Title defines all.

@SparkQA
Copy link

SparkQA commented Jun 28, 2016

Test build #61380 has finished for PR 13945 at commit 74108e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DeviceData(device: String, type: String, signal: Double, time: DateTime)
    • - The writer must do all the initialization (e.g. opening connections, starting a transaction, etc.) only when theopenmethod is called. Be aware that, if there is any initialization in the class as soon as the object is created, then that initialization will happen in the driver (because that is where the instance is being created), which may not be what you intend.

[Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java or Python to express streaming
aggregations, event-time windows, stream-to-batch joins, etc. The computation
is executed on the same optimized Spark SQL engine. Finally, the system
ensures end-to-end exactly-once fault-tolerance guarantees through
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End-to-end exactly-once sounds like over-promising. Should probably define what the ends are, because destructive outputs can't be literally exactly-once in the face of network failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensures --> can ensure


- `version` and `partition` are two parameter in the `open` that uniquely represents a set of rows that needs to be pushed out. `version` is monotonically increasing id that increases with every trigger. `partition` is an id that represents a partition of the output, since the output is distributed and will be processed on multiple executors.

- `open` can use the `version` and `partition` to choose whether it need to write the sequence of rows. Accordingly, it can return `true` (proceed with writing), or `false` (no need to write). If the `false` is returned, then `write` will not be called on any row. For example, after a partial failure, so partitions of the failed trigger may have already been committed to a database. Based on metadata stores in the database, the writer can identify partitions that have already been committed and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"whether it need" -> "whether it needs"
"If the false" -> "If false"
"so partitions" -> "some partitions"

"been committed and" ...? the end of this bullet seems to be missing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether it need to write => whether it needs to write

If the false is returned => If false is returned

partitions that have already been committed and => incomplete sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@tdas
Copy link
Contributor Author

tdas commented Jun 29, 2016

Thank you very much everyone for the detailed review! I am really thankful you caught so many issues that I missed in my first pass. I have addressed your comments as well as more comments I have received offline.

In the interest of Spark 2.0 release, I am going to prioritize merging this PR. If there are outstanding issues, lets solve them in follow up PRs. I am sure that we can improve this draft by a lot with everyone's contributions.

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61455 has finished for PR 13945 at commit 78223a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


Now consider what happens if one of the events arrives late to the application.
For example, a word that was generated at 12:04 but it was received at 12:11.
Since this windowing is based on the time in the data, the time 12:04 should considered for windowing. This occurs naturally in our window-based grouping --the late data is automatically placed in the proper windows and the correct aggregates updated as illustrated below.
Copy link
Member

@ScrapCodes ScrapCodes Jun 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor corrections.

  1. the time 12:04, should be considered for windowing.
  2. grouping - the late data

@tdas
Copy link
Contributor Author

tdas commented Jun 29, 2016

@ScrapCodes thanks for catching those. I will update them in a follow up PR. I am merging this as is to master and 2.0 in the interest of making it to Spark 2.0 RC2

asfgit pushed a commit that referenced this pull request Jun 29, 2016
…Guide

Title defines all.

Author: Tathagata Das <[email protected]>

Closes #13945 from tdas/SPARK-16256.

(cherry picked from commit 64132a1)
Signed-off-by: Tathagata Das <[email protected]>
@asfgit asfgit closed this in 64132a1 Jun 29, 2016
@tdas
Copy link
Contributor Author

tdas commented Jun 29, 2016

I have opened up another PR #13978 with left over fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants