-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-28191][SS] New data source - state - reader part #24990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
As I've mentioned SPARK-27237, this patch could be simplified when we adopt #24173 |
|
Test build #106978 has finished for PR 24990 at commit
|
|
Test build #106979 has finished for PR 24990 at commit
|
|
Retest this please. |
|
Test build #107196 has finished for PR 24990 at commit
|
|
Note: there's an ask for SPIP on umbrella issue (New data source - state) so it would take time to go through SPIP process. I'll keep this PR open to show the proposed change easily. |
|
retest this, please |
|
Test build #110705 has finished for PR 24990 at commit
|
a40110d to
64a08b9
Compare
|
Test build #110710 has finished for PR 24990 at commit
|
|
Regarding document generation failure, I've reported to dev@ mailing list. Let me trigger the build again to see whether it's flaky (though it seems to be high change then) or consistent. |
|
retest this, please |
|
Test build #110716 has finished for PR 24990 at commit
|
|
retest this, please |
|
Test build #112704 has finished for PR 24990 at commit
|
|
retest this, please |
|
Test build #116588 has finished for PR 24990 at commit
|
| org.apache.spark.sql.execution.datasources.orc.OrcFileFormat | ||
| org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2 | ||
| org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2 | ||
| org.apache.spark.sql.execution.datasources.v2.state.StateDataSourceV2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came from the PR you pointed out. Why is it state? Can batch query use this source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"state" is the one of the "terms" of "structured streaming" (not actually tied to structured streaming but tied to recent streaming technology). It's being created and used from structured streaming, but there're some cases we want to modify the state "outside" of the streaming query, like changing schema, repartitioning, etc. This data source will allow "batch query" to do it. (So the data source is not even designed to use from streaming query by intention.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So, it's designed for batch query for the state generated from structured streaming.
@HeartSaVioR, could I ask to post a working example in PR or JIRA description? I think one working example will clarify what this source/PR targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do when I get any actual reviewer who is willing to be a shepherd on this issue - the only request I got for this feature was asking for SPIP.
https://github.com/HeartSaVioR/spark-state-tools
Above repository contains entire functionalities (though it's tied to Spark 2.4 and some weird usage because Spark doesn't provide schema information) and explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking showing an example can actually clarify the importance of this source easily and hopefully we can get more review and attention. But okay, we can wait for the review first too.
|
retest this, please |
|
Test build #125732 has finished for PR 24990 at commit
|
|
Test build #125740 has finished for PR 24990 at commit
|
|
retest this, please |
|
Test build #128680 has finished for PR 24990 at commit
|
0730c2d to
7f2b74e
Compare
|
Test build #132567 has finished for PR 24990 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #132603 has finished for PR 24990 at commit
|
a6743dd to
99b00db
Compare
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #132745 has finished for PR 24990 at commit
|
|
retest this, please |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #132916 has finished for PR 24990 at commit
|
|
This is likely revised, hence I'll re-submit the PR and update the description. |
What changes were proposed in this pull request?
Please refer SPARK-28190 to refer rationalization of introducing new state data source.
This patch proposes introducing a new data source "state" on streaming query, and enable users' batch query to read state in checkpoint. The new data source is located in
sql-coremodule - I didn't create a new module in external since state is not an external storage.Given state itself has no schema information (SPARK-27237 is addressing this), this patch includes some tool (
StateSchemaExtractor) to extract the schema of state from streaming query. It would be ideal to adopt SPARK-27237 and get rid of this tool.State data source leverages existing state store APIs which would be compatible with any state store providers. That said, the data source is generic one, but could be target to specific state store provider to gain optimal performance. (on demand)
How was this patch tested?
New UTs.