Initial draft of Streaming Dataframe infrastructure #21

marmbrus · 2016-01-05T02:09:46Z

This PR adds the initial infrastructure for specifying streaming Sources, Sinks and executing queries as new data arrives. Additionally, it adds a test framework that can be used to test Sources and execution. The goal here is to get an initial version of the API committed to a shared branch so that we can continue to iterate on the various APIs.

Conflicts: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala

zsxwing · 2016-01-06T00:38:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

nit: volatile and private[streaming]?

zsxwing · 2016-01-06T01:03:13Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

Can the user reuse a Source? E.g., inputData1.toDS().union(inputData1.toDS())

Yes, that works already.

tdas · 2016-01-06T02:40:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/LongOffset.scala

Why are the basic abstractions of streaming stuff like Offset in sql.execution package? From user point of view, its not intuitive to have "execution" in the middle, and painful to import that deep a package name.

I think basic abstractions like Offset, Source, Sink should be in sql.streaming, and things like StreamExecution can be in sql.execution.streaming or sql.streaming.execution.

My thought was to by default put everything in execution which is hidden from scala doc / etc. We can move stuff out as we decide to make it public API.

I am cool with that for now.

tdas · 2016-01-06T02:46:17Z

High level question. Should I care of about class scopes and stuff right now?
At least I want to understand about package names.

tdas · 2016-01-06T03:12:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala

Therefore, it

AmplabJenkins · 2016-01-06T06:40:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/10/
Test FAILed.

JoshRosen · 2016-01-06T06:42:05Z

Jenkins, retest this please.

AmplabJenkins · 2016-01-06T06:46:06Z

Merged build finished. Test FAILed.

AmplabJenkins · 2016-01-06T06:46:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/11/
Test FAILed.

JoshRosen · 2016-01-06T06:46:44Z

@marmbrus, looks like we have style errors: https://amplab.cs.berkeley.edu/jenkins/job/spark-streaming-df-test/11/console

JoshRosen · 2016-01-06T06:52:15Z

Wait, I can fix this myself! I'll push a fix now.

AmplabJenkins · 2016-01-06T07:21:19Z

Merged build finished. Test PASSed.

AmplabJenkins · 2016-01-06T07:21:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/12/
Test PASSed.

tdas · 2016-01-06T09:10:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamProgress.scala

But its not just about serialization. StreamProgress is a map of Source --> Offset. And only offset should be serialized and deserialized, not the sources. Now at the time of recovery, if the sink deserialized only the offsets, how will create a StreamProgress (that is the map of source --> offset) with that?

There needs to be separate class which is just a seq of Offsets, which is passed on to the Sink. The ordering of the offsets when putting in the seq will be deterministic, so that the ordering is preserved after recovery. And as long as the sources can ordered deterministically, the recovered offsets can be matched.

While that is a reasonable solution, I'm not sure thats exactly how we want to do it. Here are the requirements as I see them:

User created Offsets should be serializable

A collection of offsets needs to be serializable in a way that we can reassociate them with their sources on deserialization

There should be some class (probably StreamProgress) that acts as a container and makes this opaque to Sinks

AmplabJenkins · 2016-01-06T20:18:14Z

Merged build finished. Test PASSed.

AmplabJenkins · 2016-01-06T20:18:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/13/
Test PASSed.

AmplabJenkins · 2016-01-06T21:41:13Z

Merged build finished. Test FAILed.

AmplabJenkins · 2016-01-06T21:41:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/14/
Test FAILed.

tdas · 2016-01-06T22:16:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala

Probably better to call this getOffset to keep consistent with verb-noun format. Especially since this is not expected to be a static return value.

we pretty much never use get and this function should not have side-effects. If anything I think I'd change getNextBatch to fetchNextBatch to make it clear that it can do work like caching if needed.

AmplabJenkins · 2016-01-06T22:39:26Z

Merged build finished. Test PASSed.

AmplabJenkins · 2016-01-06T22:39:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/15/
Test PASSed.

zsxwing · 2016-01-06T22:39:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

Since this is invalid, why not make currentProgress return Option[CompositeOffset]?

Yeah, I thought about that. I wasn't sure how much I wanted to expose to the user. It might actually be valid someday to save a different type of offset into a sink.

marmbrus · 2016-01-07T00:00:56Z

Going to merge and address further comments in a follow-up.

Initial draft of Streaming Dataframe infrastructure

marmbrus added 12 commits December 9, 2015 18:51

first draft

e238911

working on state

d2706b5

working on stateful streaming

7a3590f

now with event time windows

c8a9238

some refactoring after talking to ali

dddd192

docs

a0a1e7b

start kinesis

15bed31

some renaming

89464a9

WIP: file source

d133dbb

Merge remote-tracking branch 'origin/master' into streaming-infra

e3c4c83

Conflicts: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala

remove half-baked stateful implementation

90fa6d3

cleanup

b1c1dc6

marmbrus force-pushed the streaming-infra branch from c6539b4 to b1c1dc6 Compare January 5, 2016 06:59

more docs

9205068

marmbrus force-pushed the streaming-infra branch from 2084f39 to 9205068 Compare January 5, 2016 23:34

marmbrus changed the title ~~[WIP] Initial draft of streaming Dataframe infrastructure~~ Initial draft of streaming Dataframe infrastructure Jan 5, 2016

marmbrus changed the title ~~Initial draft of streaming Dataframe infrastructure~~ Initial draft of Streaming Dataframe infrastructure Jan 5, 2016

marmbrus force-pushed the streaming-infra branch from 64a6b42 to 9205068 Compare January 6, 2016 00:11

zsxwing reviewed Jan 6, 2016
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala Outdated

Copy link

zsxwing Jan 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: volatile and private[streaming]?

rollback changes

eab186d

zsxwing reviewed Jan 6, 2016
View reviewed changes

some feedback

cd575db

tdas reviewed Jan 6, 2016
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala Outdated

Copy link

Collaborator

tdas Jan 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, it

marmbrus added 2 commits January 5, 2016 22:16

Update circle.yml

b0b20e5

Update SparkBuild.scala

f5d9642

Update circle.yml

f8911e4

Add newlines to satisfy Scalastyle

6387781

tdas reviewed Jan 6, 2016
View reviewed changes

revert CI changes

6242bc2

update based on TD's comments

95fd978

tdas reviewed Jan 6, 2016
View reviewed changes

style and docs

f928595

zsxwing reviewed Jan 6, 2016
View reviewed changes

marmbrus added a commit that referenced this pull request Jan 7, 2016

Merge pull request #21 from marmbrus/streaming-infra

0630d29

Initial draft of Streaming Dataframe infrastructure

marmbrus merged commit 0630d29 into streaming-df Jan 7, 2016

Initial draft of Streaming Dataframe infrastructure #21

Initial draft of Streaming Dataframe infrastructure #21

Uh oh!

Conversation

marmbrus commented Jan 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Jan 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

JoshRosen commented Jan 6, 2016

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

JoshRosen commented Jan 6, 2016

Uh oh!

JoshRosen commented Jan 6, 2016

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

AmplabJenkins commented Jan 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Jan 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants