[SPARK-25245][DOCS][SS] Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming #22238

HeartSaVioR · 2018-08-26T21:22:47Z

What changes were proposed in this pull request?

This patch adds explanation of why "spark.sql.shuffle.partitions" keeps unchanged in structured streaming, which couple of users already wondered and some of them even thought it as a bug.

This patch would help other end users to know about such behavior before they find by theirselves and being wondered.

How was this patch tested?

No need to test because this is a simple addition on guide doc with markdown editor.

…ffle.partitions" for structured streaming

SparkQA · 2018-08-26T21:42:55Z

Test build #95268 has finished for PR 22238 at commit ec24f29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-27T14:56:04Z

docs/structured-streaming-programming-guide.md


 # Additional Information

+**Gotchas**


hmmmmm .. @HeartSaVioR how about leaving them in codes or API somewhere as note?

IMO, It would be better to keep it here as well as in the code, we may not be able to surface it in the right api docs and chance for users to miss it.

@HeartSaVioR, may be add an example here to illustrate how to use the coalesce?

I was going to add the explanation to doc() of spark.sql.shuffle.partitions, but looks like what we explained in doc() would not be published automatically. (Please correct me if I'm missing here.) SQLConf is even not exposed to scaladoc. That's why I'm adding this to structured streaming guide doc. Actually I think most of end users only take a look at this doc for structured streaming, and we can't (and shouldn't) expect end users to take a look at source code to find it.

But also actually I didn't notice that spark.sql.shuffle.partitions is explained in sql-programming-guide.md but I also think we need to explain all configs here if they work differently with batch query. spark.sql.shuffle.partitions is the case.

Btw, Gotchas looks like funny though. Maybe having section would be better. Maybe like ## Configuration Options For Structured Streaming in sql-programming-guide.md?

jaceklaskowski · 2018-08-27T18:07:07Z

docs/structured-streaming-programming-guide.md

+
+- For structured streaming, modifying "spark.sql.shuffle.partitions" is restricted once you run the query.
+  - This is because state is partitioned via key, hence number of partitions for state should be unchanged.
+  - If you want to run less tasks for stateful operations, `coalesce` would help with avoiding unnecessary repartitioning. Please note that it will also affect downstream operators.


An example of how to use coalesce operator with stateful streaming query would be superb.

I'd also appreciate if you added what type of downstream operators are affected and how.

It just means that the number of partitions in stateful operations' output will be same as parameter for coalesce, and the number of partitions will be kept unless another shuffle happens. It is implicitly same as spark.sql.shuffle.partitions without coalesce, which default value is 200.

I'll add the code, but not sure we need to have the code per language like Scala / Java / Python tabs since they will be same.

HeartSaVioR · 2018-08-28T03:15:38Z

Also adding @tdas @zsxwing @jose-torres to cc.

HyukjinKwon · 2018-08-28T03:37:28Z

docs/structured-streaming-programming-guide.md

+
+This section is for configurations which are only available for structured streaming, or they behave differently with batch query.
+
+- spark.sql.shuffle.partitions


We do have it in sql-programming-guide.md. Shall we add some info there for now?

IMHO, if something goes wrong with structured streaming, end users would try to review structured streaming guide doc, rather than sql programming guide doc. Could we wait for hearing more voices on this?

What I am worried is about adding a new section, which is quite unusual. Usually we go for it when multiple instances are detected later.

Are there more instance to describe here specifically for structured streaming?

I can revert adding a new section if you meant adding ## on it. While gotcha looks more like funny, I will change it to **Notes**.

The rationalization on adding this to doc is, this restriction had been making lots of wondering around SO and user mailing list, as well as even a patch for fixing this. So all of end users who use structured streaming would be nice to see it at least once, even they skim the doc, so that they can remember and revisit the doc once they get stuck on this.

SparkQA · 2018-08-28T03:56:30Z

Test build #95321 has finished for PR 22238 at commit 138cc63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-28T04:26:03Z

docs/structured-streaming-programming-guide.md


 # Additional Information

+**Notes**


I was thinking adding this information somewhere API or configuration only. For instance, notes like #19617.

lots of wondering around SO and user mailing list,

I don't object to note these stuff but usually the site has only key points for some features or configurations.

If there are more instance to describe specifically for structured streaming (where the same SQL configurations could lead to some confusions), I am fine with adding this. If not or less sure for now, I would add them into API's doc or configuration's doc.

HyukjinKwon · 2018-08-28T04:27:08Z

docs/structured-streaming-programming-guide.md

+    - If you want to run less tasks for stateful operations, `coalesce` would help with avoiding unnecessary repartitioning.
+      - e.g. `df.groupBy("time").count().coalesce(10)` reduces the number of tasks by 10, whereas `spark.sql.shuffle.partitions` may be bigger.
+      - After `coalesce`, the number of (reduced) tasks will be kept unless another shuffle happens.
+  - `spark.sql.streaming.stateStore.providerClass`


Ah, okay, so there are more instances to describe here. If so, im okay.

SparkQA · 2018-08-28T07:05:01Z

Test build #95326 has finished for PR 22238 at commit e2ee43d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2018-08-28T10:44:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

  val SHUFFLE_PARTITIONS = buildConf("spark.sql.shuffle.partitions")
-    .doc("The default number of partitions to use when shuffling data for joins or aggregations.")
+    .doc("The default number of partitions to use when shuffling data for joins or aggregations. " +
+      "Note: For structured streaming, this configuration cannot be changed between query " +


s/cannot be/must not be/

The sentence is borrowed from existing one:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 978 to 979 in 8198ea5

"Note: This configuration cannot be changed between query restarts from the same " +

"checkpoint location.")

jaceklaskowski · 2018-08-28T10:44:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

        "The class used to manage state data in stateful streaming queries. This class must " +
-          "be a subclass of StateStoreProvider, and must have a zero-arg constructor.")
+          "be a subclass of StateStoreProvider, and must have a zero-arg constructor. " +
+          "Note: For structured streaming, this configuration cannot be changed between query " +


s/cannot be/must not be/

HeartSaVioR · 2018-12-14T06:41:36Z

Could we revisit on this? This is simple doc change helping end users to troubleshoot which could be confused without knowing the details.

srowen

Not as sure about the docs additions, but the additional info in param descriptions seems handy.

srowen · 2018-12-16T00:41:03Z

docs/structured-streaming-programming-guide.md


+**Notes**
+
+- There're couple of configurations which are not modifiable once you run the query. If you really want to make changes for these configurations, you have to discard checkpoint and start a new query.


There're -> "There are a". This should probably be reworded though; doesn't need the 2nd person. "Several configurations are not modifiable after the query has run. To change them, discard the checkpoint and start a new query. These configurations include:"

I'm not sure how much we need to explain why they're not modifiable, but suggesting alternatives is good. I think the example is redundant below.

Reworded expression looks shorter and better. Will address.

I'm not sure how much we need to explain why they're not modifiable

Yeah, I agree this is something everyone has different view. Someone could think the page is a guide so don't need to include implementation details. Other one (including me) could think the information is described nowhere so need to be included altogether, especially this is the major difference between batch query and streaming query.

No strong opinion (verbose vs less wondering on end users) so I can follow up the decision. If you think this is redundant please let me know so that I can address.

I think the example is redundant below.

Will address.

srowen · 2018-12-16T00:41:30Z

docs/structured-streaming-programming-guide.md

+      - e.g. `df.groupBy("time").count().coalesce(10)` reduces the number of tasks by 10, whereas `spark.sql.shuffle.partitions` may be bigger.
+      - After `coalesce`, the number of (reduced) tasks will be kept unless another shuffle happens.
+  - `spark.sql.streaming.stateStore.providerClass`
+    - To read previous state of the query properly, the class of state store provider should be unchanged.


To read the previous state, etc. These also don't need to be sub bullet points?

Will concat two line via : and add the.

srowen · 2018-12-16T00:41:38Z

docs/structured-streaming-programming-guide.md

+- There're couple of configurations which are not modifiable once you run the query. If you really want to make changes for these configurations, you have to discard checkpoint and start a new query.
+  - `spark.sql.shuffle.partitions`
+    - This is due to the physical partitioning of state: state is partitioned via applying hash function to key, hence the number of partitions for state should be unchanged.
+    - If you want to run less tasks for stateful operations, `coalesce` would help with avoiding unnecessary repartitioning.


less -> fewer

Nice finding. Will address.

SparkQA · 2018-12-17T05:37:08Z

Test build #100212 has finished for PR 22238 at commit bb45c26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-12-17T11:45:33Z

docs/structured-streaming-programming-guide.md

+
+- Several configurations are not modifiable after the query has run. To change them, discard the checkpoint and start a new query. These configurations include:
+  - `spark.sql.shuffle.partitions`
+    - This is due to the physical partitioning of state: state is partitioned via applying hash function to key, hence the number of partitions for state should be unchanged.


Not sure we need sub- and sub-sub-lists here, but I don't mind much. Do end users need to know why? those details are more often relevant in source code docs

If I'm one of end users (and I am) I'd wonder why I'm restricted to change these parameters. There're fairly many developers who don't blindly get anything and want to understand. I'm wondering these details are OK to be included to guide doc, but we also don't have docs for explaining the details, or FAQ/troubleshooting, so not sure where to put.

If we are OK with adding these details to the doc of its config in source code (like below) I'm OK with it. Otherwise I'm not sure they can search and find the description in source code.

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 265 to 268 in 114d0de

val SHUFFLE_PARTITIONS = buildConf("spark.sql.shuffle.partitions")

.doc("The default number of partitions to use when shuffling data for joins or aggregations.")

.intConf

.createWithDefault(200)

(Chiming in to say that I've definitely seen end users get confused and want to know why this restriction is so.)

Sounds good to me, leave it in

srowen · 2018-12-22T16:32:45Z

Merged to master

HeartSaVioR · 2018-12-23T00:13:49Z

Thanks all for reviewing and merging!

…park.sql.shuffle.partitions" for structured streaming ## What changes were proposed in this pull request? This patch adds explanation of `why "spark.sql.shuffle.partitions" keeps unchanged in structured streaming`, which couple of users already wondered and some of them even thought it as a bug. This patch would help other end users to know about such behavior before they find by theirselves and being wondered. ## How was this patch tested? No need to test because this is a simple addition on guide doc with markdown editor. Closes apache#22238 from HeartSaVioR/SPARK-25245. Lead-authored-by: Jungtaek Lim <[email protected]> Co-authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Sean Owen <[email protected]>

SPARK-25245 Explain regarding limiting modification on "spark.sql.shu…

ec24f29

…ffle.partitions" for structured streaming

HyukjinKwon reviewed Aug 27, 2018

View reviewed changes

jaceklaskowski reviewed Aug 27, 2018

View reviewed changes

Address review comments

138cc63

HyukjinKwon reviewed Aug 28, 2018

View reviewed changes

Replace section with bold string, changed section's title to "note"

e2ee43d

HyukjinKwon reviewed Aug 28, 2018

View reviewed changes

jaceklaskowski reviewed Aug 28, 2018

View reviewed changes

srowen requested changes Dec 16, 2018

View reviewed changes

Address @srowen review comments

bb45c26

srowen reviewed Dec 17, 2018

View reviewed changes

srowen approved these changes Dec 22, 2018

View reviewed changes

srowen closed this in 90a8103 Dec 22, 2018

HeartSaVioR deleted the SPARK-25245 branch January 25, 2019 22:12


		This section is for configurations which are only available for structured streaming, or they behave differently with batch query.

		- spark.sql.shuffle.partitions

	"Note: This configuration cannot be changed between query restarts from the same " +
	"checkpoint location.")


		Notes

		- There're couple of configurations which are not modifiable once you run the query. If you really want to make changes for these configurations, you have to discard checkpoint and start a new query.

	val SHUFFLE_PARTITIONS = buildConf("spark.sql.shuffle.partitions")
	.doc("The default number of partitions to use when shuffling data for joins or aggregations.")
	.intConf
	.createWithDefault(200)

[SPARK-25245][DOCS][SS] Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming #22238

[SPARK-25245][DOCS][SS] Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming #22238

Uh oh!

Conversation

HeartSaVioR commented Aug 26, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arunmahadevan Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Aug 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Dec 14, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Dec 22, 2018

arunmahadevan Aug 27, 2018 •

edited

Loading

HeartSaVioR Aug 27, 2018 •

edited

Loading

HeartSaVioR Aug 27, 2018 •

edited

Loading