Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Apply review remarks.
  • Loading branch information
ala committed Feb 26, 2018
commit 051273651cd65b9eca568b37c79b50342a7f69c2
2 changes: 2 additions & 0 deletions core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,8 @@ private class DefaultPartitionCoalescer(val balanceSlack: Double = 0.10)
}
// if we don't have enough partition groups, create duplicates
while (numCreated < targetLen) {
// Copy the preferred location from a random input partition.
// This helps in avoiding skew when the input partitions are clustered by preferred location.
val (nxt_replica, nxt_part) = partitionLocs.partsWithLocs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add comment to explain the purpose of this change here?

rnd.nextInt(partitionLocs.partsWithLocs.length))
val pgroup = new PartitionGroup(Some(nxt_replica))
Expand Down
5 changes: 2 additions & 3 deletions core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1152,9 +1152,8 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
.groupBy(identity)
.mapValues(_.size)

// Without the fix these would be:
// numPartsPerLocation(locations(0)) == numCoalescedPartitions - 1
// numPartsPerLocation(locations(1)) == 1
// Make sure the coalesced partitions are distributed fairly evenly between the two locations.
// This should not become flaky since the DefaultPartitionsCoalescer uses a fixed seed.
assert(numPartsPerLocation(locations(0)) > 0.4 * numCoalescedPartitions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confident are we on the assert condition to be true? How is the fraction 0.4 chosen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, the result is deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment about flakiness & fixed seed.

assert(numPartsPerLocation(locations(1)) > 0.4 * numCoalescedPartitions)
}
Expand Down