[WIP][SPARK-28191][SS] New data source - state - reader part #24990

HeartSaVioR · 2019-06-27T21:53:59Z

What changes were proposed in this pull request?

Please refer SPARK-28190 to refer rationalization of introducing new state data source.

This patch proposes introducing a new data source "state" on streaming query, and enable users' batch query to read state in checkpoint. The new data source is located in sql-core module - I didn't create a new module in external since state is not an external storage.

Given state itself has no schema information (SPARK-27237 is addressing this), this patch includes some tool (StateSchemaExtractor) to extract the schema of state from streaming query. It would be ideal to adopt SPARK-27237 and get rid of this tool.

State data source leverages existing state store APIs which would be compatible with any state store providers. That said, the data source is generic one, but could be target to specific state store provider to gain optimal performance. (on demand)

How was this patch tested?

New UTs.

HeartSaVioR · 2019-06-27T21:56:26Z

As I've mentioned SPARK-27237, this patch could be simplified when we adopt #24173

SparkQA · 2019-06-27T22:03:38Z

Test build #106978 has finished for PR 24990 at commit 980a46f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StateDataSourceV2 extends TableProvider with DataSourceRegister
class StatePartitionReader(
class StatePartitionReaderFactory(
class StateScanBuilder(
class StateStoreInputPartition(
class StateScan(
class StateSchemaExtractor(spark: SparkSession) extends Logging
case class StateSchemaInfo(
class StateTable(

SparkQA · 2019-06-28T01:15:45Z

Test build #106979 has finished for PR 24990 at commit a40110d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-07-03T21:41:27Z

Retest this please.

SparkQA · 2019-07-04T01:04:06Z

Test build #107196 has finished for PR 24990 at commit a40110d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-08-20T21:24:00Z

Note: there's an ask for SPIP on umbrella issue (New data source - state) so it would take time to go through SPIP process. I'll keep this PR open to show the proposed change easily.

HeartSaVioR · 2019-09-17T00:47:22Z

retest this, please

SparkQA · 2019-09-17T00:55:58Z

Test build #110705 has finished for PR 24990 at commit a40110d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-17T01:34:53Z

Test build #110710 has finished for PR 24990 at commit 64a08b9.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-17T02:14:47Z

Regarding document generation failure, I've reported to dev@ mailing list. Let me trigger the build again to see whether it's flaky (though it seems to be high change then) or consistent.

HeartSaVioR · 2019-09-17T02:14:53Z

retest this, please

SparkQA · 2019-09-17T07:05:02Z

Test build #110716 has finished for PR 24990 at commit 64a08b9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-10-26T02:29:20Z

retest this, please

SparkQA · 2019-10-26T06:09:47Z

Test build #112704 has finished for PR 24990 at commit 64a08b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-01-13T02:58:44Z

retest this, please

SparkQA · 2020-01-13T07:45:07Z

Test build #116588 has finished for PR 24990 at commit 64a08b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-09T12:20:14Z

sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

 org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
 org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2
 org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2
+org.apache.spark.sql.execution.datasources.v2.state.StateDataSourceV2


I came from the PR you pointed out. Why is it state? Can batch query use this source?

"state" is the one of the "terms" of "structured streaming" (not actually tied to structured streaming but tied to recent streaming technology). It's being created and used from structured streaming, but there're some cases we want to modify the state "outside" of the streaming query, like changing schema, repartitioning, etc. This data source will allow "batch query" to do it. (So the data source is not even designed to use from streaming query by intention.)

I see. So, it's designed for batch query for the state generated from structured streaming.
@HeartSaVioR, could I ask to post a working example in PR or JIRA description? I think one working example will clarify what this source/PR targets.

I would do when I get any actual reviewer who is willing to be a shepherd on this issue - the only request I got for this feature was asking for SPIP.

https://github.com/HeartSaVioR/spark-state-tools

Above repository contains entire functionalities (though it's tied to Spark 2.4 and some weird usage because Spark doesn't provide schema information) and explanation.

I was thinking showing an example can actually clarify the importance of this source easily and hopefully we can get more review and attention. But okay, we can wait for the review first too.

HeartSaVioR · 2020-07-12T20:28:46Z

retest this, please

SparkQA · 2020-07-12T20:41:10Z

Test build #125732 has finished for PR 24990 at commit 64a08b9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-13T06:14:14Z

Test build #125740 has finished for PR 24990 at commit 0730c2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-09-15T00:23:04Z

retest this, please

SparkQA · 2020-09-15T06:27:01Z

Test build #128680 has finished for PR 24990 at commit 0730c2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-10T12:53:30Z

Test build #132567 has finished for PR 24990 at commit 7f2b74e.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class StateDataSourceV2 extends TableProvider with DataSourceRegister
class StatePartitionReader(
class StatePartitionReaderFactory(
class StateScanBuilder(
class StateStoreInputPartition(
class StateScan(
class StateSchemaExtractor(spark: SparkSession) extends Logging
case class StateSchemaInfo(
class StateTable(

SparkQA · 2020-12-10T13:31:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37171/

SparkQA · 2020-12-10T14:03:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37171/

SparkQA · 2020-12-11T01:33:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37207/

SparkQA · 2020-12-11T02:09:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37207/

SparkQA · 2020-12-11T02:14:11Z

Test build #132603 has finished for PR 24990 at commit a6743dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

… investigating query

SparkQA · 2020-12-14T05:57:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37345/

SparkQA · 2020-12-14T06:36:07Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37345/

SparkQA · 2020-12-14T09:47:57Z

Test build #132745 has finished for PR 24990 at commit a495f6d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-12-17T02:25:37Z

retest this, please

SparkQA · 2020-12-17T03:22:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37519/

SparkQA · 2020-12-17T03:57:46Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37519/

SparkQA · 2020-12-17T06:45:43Z

Test build #132916 has finished for PR 24990 at commit a495f6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-12-18T12:04:45Z

This is likely revised, hence I'll re-submit the PR and update the description.

dongjoon-hyun added the STRUCTURED STREAMING label Jun 28, 2019

HeartSaVioR force-pushed the SPARK-28191 branch from a40110d to 64a08b9 Compare September 17, 2019 01:10

HeartSaVioR mentioned this pull request Apr 7, 2020

[SPARK-31330] Automatically label PRs based on the paths they touch #28114

Closed

HyukjinKwon reviewed Apr 9, 2020

View reviewed changes

HeartSaVioR force-pushed the SPARK-28191 branch from 64a08b9 to 0730c2d Compare July 13, 2020 00:37

probot-autolabeler bot added the SQL label Jul 13, 2020

[SPARK-28191][SS] New data source - state - reader part

7f2b74e

HeartSaVioR force-pushed the SPARK-28191 branch from 0730c2d to 7f2b74e Compare December 10, 2020 12:22

Incorporate the schema information from SPARK-27237

cd7a74a

HeartSaVioR changed the title ~~[SPARK-28191][SS] New data source - state - reader part~~ [WIP][SPARK-28191][SS] New data source - state - reader part Dec 14, 2020

HeartSaVioR marked this pull request as draft December 14, 2020 04:14

add test with flatMapGroupWithState, change some params as optional

99b00db

HeartSaVioR force-pushed the SPARK-28191 branch from a6743dd to 99b00db Compare December 14, 2020 04:22

HeartSaVioR added 2 commits December 14, 2020 13:48

Remove CheckpointUtil given it's helpful on writer

029c6d0

Remove schema extractor as SPARK-27237 enables to read schema without…

a495f6d

… investigating query

HeartSaVioR closed this Dec 18, 2020

[WIP][SPARK-28191][SS] New data source - state - reader part #24990

[WIP][SPARK-28191][SS] New data source - state - reader part #24990

Uh oh!

Conversation

HeartSaVioR commented Jun 27, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HeartSaVioR commented Jun 27, 2019

Uh oh!

SparkQA commented Jun 27, 2019

Uh oh!

SparkQA commented Jun 28, 2019

Uh oh!

dongjoon-hyun commented Jul 3, 2019

Uh oh!

SparkQA commented Jul 4, 2019

Uh oh!

HeartSaVioR commented Aug 20, 2019

Uh oh!

HeartSaVioR commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

HeartSaVioR commented Sep 17, 2019

Uh oh!

HeartSaVioR commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

HeartSaVioR commented Oct 26, 2019

Uh oh!

SparkQA commented Oct 26, 2019

Uh oh!

HeartSaVioR commented Jan 13, 2020

Uh oh!

SparkQA commented Jan 13, 2020

Uh oh!

HyukjinKwon Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Apr 9, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jul 12, 2020

Uh oh!

SparkQA commented Jul 12, 2020

Uh oh!

SparkQA commented Jul 13, 2020

Uh oh!

HeartSaVioR commented Sep 15, 2020

Uh oh!

SparkQA commented Sep 15, 2020

Uh oh!

SparkQA commented Dec 10, 2020

Uh oh!

SparkQA commented Dec 10, 2020

Uh oh!

SparkQA commented Dec 10, 2020

Uh oh!

SparkQA commented Dec 11, 2020

Uh oh!

SparkQA commented Dec 11, 2020

Uh oh!

SparkQA commented Dec 11, 2020

Uh oh!

SparkQA commented Dec 14, 2020

HyukjinKwon Apr 9, 2020 •

edited

Loading

HeartSaVioR Apr 10, 2020 •

edited

Loading