Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/structured-streaming-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output Sinks](#output-sinks) sections f

# Additional Information

**Gotchas**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmmm .. @HeartSaVioR how about leaving them in codes or API somewhere as note?

Copy link
Contributor

@arunmahadevan arunmahadevan Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, It would be better to keep it here as well as in the code, we may not be able to surface it in the right api docs and chance for users to miss it.

@HeartSaVioR, may be add an example here to illustrate how to use the coalesce?

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to add the explanation to doc() of spark.sql.shuffle.partitions, but looks like what we explained in doc() would not be published automatically. (Please correct me if I'm missing here.) SQLConf is even not exposed to scaladoc. That's why I'm adding this to structured streaming guide doc. Actually I think most of end users only take a look at this doc for structured streaming, and we can't (and shouldn't) expect end users to take a look at source code to find it.

But also actually I didn't notice that spark.sql.shuffle.partitions is explained in sql-programming-guide.md but I also think we need to explain all configs here if they work differently with batch query. spark.sql.shuffle.partitions is the case.

Btw, Gotchas looks like funny though. Maybe having section would be better. Maybe like ## Configuration Options For Structured Streaming in sql-programming-guide.md?


- For structured streaming, modifying "spark.sql.shuffle.partitions" is restricted once you run the query.
- This is because state is partitioned via key, hence number of partitions for state should be unchanged.
- If you want to run less tasks for stateful operations, `coalesce` would help with avoiding unnecessary repartitioning. Please note that it will also affect downstream operators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of how to use coalesce operator with stateful streaming query would be superb.

I'd also appreciate if you added what type of downstream operators are affected and how.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just means that the number of partitions in stateful operations' output will be same as parameter for coalesce, and the number of partitions will be kept unless another shuffle happens. It is implicitly same as spark.sql.shuffle.partitions without coalesce, which default value is 200.

I'll add the code, but not sure we need to have the code per language like Scala / Java / Python tabs since they will be same.


**Further Reading**

- See and run the
Expand Down