Skip to content

Conversation

@10110346
Copy link
Contributor

@10110346 10110346 commented Dec 5, 2018

What changes were proposed in this pull request?

1. The shuffle dependency specifies no aggregation or output ordering.
If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, serialized shuffle can still be used.
3. The shuffle produces fewer than 16777216 output partitions.
If the number of output partitions is 16777216 , we can use serialized shuffle.

We can see this mothod: canUseSerializedShuffle

How was this patch tested?

N/A

@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #99703 has finished for PR 23228 at commit d5dadbf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #4453 has finished for PR 23228 at commit d5dadbf.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the test failure can be ignored as it can't be related.

*
* - Serialized sorting: used when all three of the following conditions hold:
* 1. The shuffle dependency specifies no aggregation or output ordering.
* 1. The shuffle dependency specifies no map-side combine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this sound right @JoshRosen ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks right to me, according to

@10110346
Copy link
Contributor Author

10110346 commented Dec 7, 2018

cc @JoshRosen @cloud-fan

@cloud-fan
Copy link
Contributor

LGTM, cc @jiangxb1987

@jiangxb1987
Copy link
Contributor

Please update the title [MINOR][DOC] Update the condition description of serialized shuffle

@10110346 10110346 changed the title [MINOR][DOC]The condition description of serialized shuffle is not very accurate [MINOR][DOC] Update the condition description of serialized shuffle Dec 10, 2018
@10110346
Copy link
Contributor Author

I have updated, thanks all.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99892 has finished for PR 23228 at commit d5dadbf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@10110346
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99902 has finished for PR 23228 at commit d5dadbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 9794923 Dec 10, 2018
srowen pushed a commit that referenced this pull request Dec 13, 2018
## What changes were proposed in this pull request?
These three condition descriptions should be updated, follow #23228  :
<li>no Ordering is specified,</li>
<li>no Aggregator is specified, and</li>
<li>the number of partitions is less than
<code>spark.shuffle.sort.bypassMergeThreshold</code>.
</li>
1、If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, BypassMergeSortShuffle can still be used.
2、If the number of output partitions is spark.shuffle.sort.bypassMergeThreshold(eg.200), we can use BypassMergeSortShuffle.

## How was this patch tested?
N/A

Closes #23281 from lcqzte10192193/wid-lcq-1211.

Authored-by: lichaoqun <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
## What changes were proposed in this pull request?
These three condition descriptions should be updated, follow apache#23228  :
<li>no Ordering is specified,</li>
<li>no Aggregator is specified, and</li>
<li>the number of partitions is less than
<code>spark.shuffle.sort.bypassMergeThreshold</code>.
</li>
1、If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, BypassMergeSortShuffle can still be used.
2、If the number of output partitions is spark.shuffle.sort.bypassMergeThreshold(eg.200), we can use BypassMergeSortShuffle.

## How was this patch tested?
N/A

Closes apache#23281 from lcqzte10192193/wid-lcq-1211.

Authored-by: lichaoqun <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?
`1. The shuffle dependency specifies no aggregation or output ordering.`
If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, serialized shuffle can still be used.
`3. The shuffle produces fewer than 16777216 output partitions.`
If the number of output partitions is 16777216 , we can use serialized shuffle.

We can see this mothod: `canUseSerializedShuffle`
## How was this patch tested?
N/A

Closes apache#23228 from 10110346/SerializedShuffle_doc.

Authored-by: liuxian <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?
These three condition descriptions should be updated, follow apache#23228  :
<li>no Ordering is specified,</li>
<li>no Aggregator is specified, and</li>
<li>the number of partitions is less than
<code>spark.shuffle.sort.bypassMergeThreshold</code>.
</li>
1、If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, BypassMergeSortShuffle can still be used.
2、If the number of output partitions is spark.shuffle.sort.bypassMergeThreshold(eg.200), we can use BypassMergeSortShuffle.

## How was this patch tested?
N/A

Closes apache#23281 from lcqzte10192193/wid-lcq-1211.

Authored-by: lichaoqun <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants