-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40407][SQL] Fix the potential data skew caused by df.repartition #37855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
cf7bdba
73620b4
2c3c4b4
20757ae
c4423b6
8968d4f
0f1c677
57773d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,7 @@ import java.util.Random | |
| import java.util.function.Supplier | ||
|
|
||
| import scala.concurrent.Future | ||
| import scala.util.hashing | ||
|
|
||
| import org.apache.spark._ | ||
| import org.apache.spark.internal.config | ||
|
|
@@ -299,7 +300,14 @@ object ShuffleExchangeExec { | |
| def getPartitionKeyExtractor(): InternalRow => Any = newPartitioning match { | ||
| case RoundRobinPartitioning(numPartitions) => | ||
| // Distributes elements evenly across output partitions, starting from a random partition. | ||
| var position = new Random(TaskContext.get().partitionId()).nextInt(numPartitions) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I may miss something. The original code should already produce different starting positions for different mapper tasks?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK I tried Can we add some comments to explain why we add
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was fixed in SPARK-21782 for RDD - looks like the sql version did not leverage it.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mridulm wow, good findings. I didn't realize there was a similar issue. |
||
| // nextInt(numPartitions) implementation has a special case when bound is a power of 2, | ||
| // which is basically taking several highest bits from the initial seed, with only a | ||
| // minimal scrambling. Due to deterministic seed, using the generator only once, | ||
| // and lack of scrambling, the position values for power-of-two numPartitions always | ||
| // end up being almost the same regardless of the index. substantially scrambling the | ||
| // seed by hashing will help. Refer to SPARK-21782 for more details. | ||
| val partitionId = TaskContext.get().partitionId() | ||
| var position = new Random(hashing.byteswap32(partitionId)).nextInt(numPartitions) | ||
| (row: InternalRow) => { | ||
| // The HashPartitioner will handle the `mod` by the number of partitions | ||
| position += 1 | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing this can actually cause a regression such as skewed data since the starting partition is always same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thx for reviewing.
The original comment "starting from a random partition", I think please correct me if I am wrong, is meaning "from which reducer partition beginning to do the shuffle write with RoundRobin manner. Basically, the big data should be distributed evenly in the same partition. But the issue here is the shuffle partition does not contain much data, the data actually is smaller than the total reducer partition, which means if the starting position is the same for all the shuffle partitions, then all the data will be distributed into the same reducer partitions for all the shuffle partitions, and some reducer partitions will not have any data.
This PR just makes the partitionId the default starting position to do the RoundRobin, which means each shuffle partition has a different starting position,
I tested the below code
w/ my PR, It outputs
24,25,26,25, w/ o my PR, it outputs50,0,0,50Similarly, if I change to repartition(8)
w/ my PR, It outputs
12,13,14,13,12,12,12,12, w/ o my PR, it outputs0,0,0,0,0,0,50,50There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the Random to XORShiftRandom.