Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

Please refer SPARK-28190 to refer rationalization of introducing new state data source.

This patch proposes introducing a new data source "state" on streaming query, and enable users' batch query to read state in checkpoint. The new data source is located in sql-core module - I didn't create a new module in external since state is not an external storage.

Given state itself has no schema information (SPARK-27237 is addressing this), this patch includes some tool (StateSchemaExtractor) to extract the schema of state from streaming query. It would be ideal to adopt SPARK-27237 and get rid of this tool.

State data source leverages existing state store APIs which would be compatible with any state store providers. That said, the data source is generic one, but could be target to specific state store provider to gain optimal performance. (on demand)

How was this patch tested?

New UTs.

@HeartSaVioR
Copy link
Contributor Author

As I've mentioned SPARK-27237, this patch could be simplified when we adopt #24173

@SparkQA
Copy link

SparkQA commented Jun 27, 2019

Test build #106978 has finished for PR 24990 at commit 980a46f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StateDataSourceV2 extends TableProvider with DataSourceRegister
  • class StatePartitionReader(
  • class StatePartitionReaderFactory(
  • class StateScanBuilder(
  • class StateStoreInputPartition(
  • class StateScan(
  • class StateSchemaExtractor(spark: SparkSession) extends Logging
  • case class StateSchemaInfo(
  • class StateTable(

@SparkQA
Copy link

SparkQA commented Jun 28, 2019

Test build #106979 has finished for PR 24990 at commit a40110d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 4, 2019

Test build #107196 has finished for PR 24990 at commit a40110d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Note: there's an ask for SPIP on umbrella issue (New data source - state) so it would take time to go through SPIP process. I'll keep this PR open to show the proposed change easily.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110705 has finished for PR 24990 at commit a40110d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110710 has finished for PR 24990 at commit 64a08b9.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Regarding document generation failure, I've reported to dev@ mailing list. Let me trigger the build again to see whether it's flaky (though it seems to be high change then) or consistent.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110716 has finished for PR 24990 at commit 64a08b9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Oct 26, 2019

Test build #112704 has finished for PR 24990 at commit 64a08b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116588 has finished for PR 24990 at commit 64a08b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2
org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2
org.apache.spark.sql.execution.datasources.v2.state.StateDataSourceV2
Copy link
Member

@HyukjinKwon HyukjinKwon Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came from the PR you pointed out. Why is it state? Can batch query use this source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"state" is the one of the "terms" of "structured streaming" (not actually tied to structured streaming but tied to recent streaming technology). It's being created and used from structured streaming, but there're some cases we want to modify the state "outside" of the streaming query, like changing schema, repartitioning, etc. This data source will allow "batch query" to do it. (So the data source is not even designed to use from streaming query by intention.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So, it's designed for batch query for the state generated from structured streaming.
@HeartSaVioR, could I ask to post a working example in PR or JIRA description? I think one working example will clarify what this source/PR targets.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do when I get any actual reviewer who is willing to be a shepherd on this issue - the only request I got for this feature was asking for SPIP.

https://github.com/HeartSaVioR/spark-state-tools

Above repository contains entire functionalities (though it's tied to Spark 2.4 and some weird usage because Spark doesn't provide schema information) and explanation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking showing an example can actually clarify the importance of this source easily and hopefully we can get more review and attention. But okay, we can wait for the review first too.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 12, 2020

Test build #125732 has finished for PR 24990 at commit 64a08b9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2020

Test build #125740 has finished for PR 24990 at commit 0730c2d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Sep 15, 2020

Test build #128680 has finished for PR 24990 at commit 0730c2d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 10, 2020

Test build #132567 has finished for PR 24990 at commit 7f2b74e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class StateDataSourceV2 extends TableProvider with DataSourceRegister
  • class StatePartitionReader(
  • class StatePartitionReaderFactory(
  • class StateScanBuilder(
  • class StateStoreInputPartition(
  • class StateScan(
  • class StateSchemaExtractor(spark: SparkSession) extends Logging
  • case class StateSchemaInfo(
  • class StateTable(

@SparkQA
Copy link

SparkQA commented Dec 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37171/

@SparkQA
Copy link

SparkQA commented Dec 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37171/

@SparkQA
Copy link

SparkQA commented Dec 11, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37207/

@SparkQA
Copy link

SparkQA commented Dec 11, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37207/

@SparkQA
Copy link

SparkQA commented Dec 11, 2020

Test build #132603 has finished for PR 24990 at commit a6743dd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR HeartSaVioR changed the title [SPARK-28191][SS] New data source - state - reader part [WIP][SPARK-28191][SS] New data source - state - reader part Dec 14, 2020
@HeartSaVioR HeartSaVioR marked this pull request as draft December 14, 2020 04:14
@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37345/

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37345/

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Test build #132745 has finished for PR 24990 at commit a495f6d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37519/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37519/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Test build #132916 has finished for PR 24990 at commit a495f6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

This is likely revised, hence I'll re-submit the PR and update the description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants