Skip to content

Conversation

@marmbrus
Copy link
Owner

@marmbrus marmbrus commented Jan 5, 2016

This PR adds the initial infrastructure for specifying streaming Sources, Sinks and executing queries as new data arrives. Additionally, it adds a test framework that can be used to test Sources and execution. The goal here is to get an initial version of the API committed to a shared branch so that we can continue to iterate on the various APIs.

@marmbrus marmbrus changed the title [WIP] Initial draft of streaming Dataframe infrastructure Initial draft of streaming Dataframe infrastructure Jan 5, 2016
@marmbrus marmbrus changed the title Initial draft of streaming Dataframe infrastructure Initial draft of Streaming Dataframe infrastructure Jan 5, 2016
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: volatile and private[streaming]?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the user reuse a Source? E.g., inputData1.toDS().union(inputData1.toDS())

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that works already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the basic abstractions of streaming stuff like Offset in sql.execution package? From user point of view, its not intuitive to have "execution" in the middle, and painful to import that deep a package name.

I think basic abstractions like Offset, Source, Sink should be in sql.streaming, and things like StreamExecution can be in sql.execution.streaming or sql.streaming.execution.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was to by default put everything in execution which is hidden from scala doc / etc. We can move stuff out as we decide to make it public API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am cool with that for now.

@tdas
Copy link
Collaborator

tdas commented Jan 6, 2016

High level question. Should I care of about class scopes and stuff right now?
At least I want to understand about package names.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, it

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/10/
Test FAILed.

@JoshRosen
Copy link
Collaborator

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/11/
Test FAILed.

@JoshRosen
Copy link
Collaborator

@JoshRosen
Copy link
Collaborator

Wait, I can fix this myself! I'll push a fix now.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/12/
Test PASSed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But its not just about serialization. StreamProgress is a map of Source --> Offset. And only offset should be serialized and deserialized, not the sources. Now at the time of recovery, if the sink deserialized only the offsets, how will create a StreamProgress (that is the map of source --> offset) with that?

There needs to be separate class which is just a seq of Offsets, which is passed on to the Sink. The ordering of the offsets when putting in the seq will be deterministic, so that the ordering is preserved after recovery. And as long as the sources can ordered deterministically, the recovered offsets can be matched.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that is a reasonable solution, I'm not sure thats exactly how we want to do it. Here are the requirements as I see them:

  • User created Offsets should be serializable
  • A collection of offsets needs to be serializable in a way that we can reassociate them with their sources on deserialization
  • There should be some class (probably StreamProgress) that acts as a container and makes this opaque to Sinks

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/13/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/14/
Test FAILed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to call this getOffset to keep consistent with verb-noun format. Especially since this is not expected to be a static return value.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we pretty much never use get and this function should not have side-effects. If anything I think I'd change getNextBatch to fetchNextBatch to make it clear that it can do work like caching if needed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/15/
Test PASSed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is invalid, why not make currentProgress return Option[CompositeOffset]?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that. I wasn't sure how much I wanted to expose to the user. It might actually be valid someday to save a different type of offset into a sink.

@marmbrus
Copy link
Owner Author

marmbrus commented Jan 7, 2016

Going to merge and address further comments in a follow-up.

marmbrus added a commit that referenced this pull request Jan 7, 2016
Initial draft of Streaming Dataframe infrastructure
@marmbrus marmbrus merged commit 0630d29 into streaming-df Jan 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants