Skip to content

Conversation

@adrian-ionescu
Copy link
Contributor

@adrian-ionescu adrian-ionescu commented Nov 27, 2017

What changes were proposed in this pull request?

This PR introduces a way to explicitly range-partition a Dataset. So far, only round-robin and hash partitioning were possible via df.repartition(...), but sometimes range partitioning might be desirable: e.g. when writing to disk, for better compression without the cost of global sort.

The current implementation piggybacks on the existing RepartitionByExpression LogicalPlan and simply adds the following logic: If its expressions are of type SortOrder, then it will do RangePartitioning; otherwise HashPartitioning. This was by far the least intrusive solution I could come up with.

How was this patch tested?

Unit test for RepartitionByExpression changes, a test to ensure we're not changing the behavior of existing .repartition() and a few end-to-end tests in DataFrameSuite.

@hvanhovell
Copy link
Contributor

ok to test

*/
@scala.annotation.varargs
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T] = withTypedPlan {
partitionExprs.find(_.expr.isInstanceOf[SortOrder]).foreach { sortOrder =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use collect or filter? That way we can show all offending columns.

val (sortOrder, nonSortOrder) = partitionExpressions.partition(_.isInstanceOf[SortOrder])

require(sortOrder.isEmpty || nonSortOrder.isEmpty,
s"""${getClass.getSimpleName} expects that either all its `partitionExpressions` are of type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want this to be a multiline message? it makes sense to put the sort order and non sort order on new lines.

// RepartitionByExpression's constructor verifies that either all expressions are
// of type SortOrder, in which case we're doing RangePartitioning, or none of them are,
// in which case we're doing HashPartitioning.
val partitioning = if (expressions.forall(_.isInstanceOf[SortOrder])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed this before, but to me it makes slightly more sense to add this logic to the RepartitionByExpression plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and it also makes it easier to unit test this code :)


// .repartitionByRange() assumes .asc by default if no explicit sort order is specified
checkAnswer(
data2d.toDF("a", "b").repartitionByRange(data1d.size, $"a".desc, $"b")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data1d.size?

@hvanhovell
Copy link
Contributor

add to whitelist

@hvanhovell
Copy link
Contributor

@adrian-ionescu this looks pretty good. I left a few small comments.

}
}

test("repartitionByRange") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move it to DataFrameSuite?

* @since 2.3.0
*/
@scala.annotation.varargs
def repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T] = withTypedPlan {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open a JIRA for adding the corresponding API in PySpark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Raised SPARK-22624.

col.expr match {
case expr: SortOrder =>
expr
case expr: Expression =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened if we have a SortOrder that is not in the root node of expr?

    data1d.toDF("val").repartitionByRange(data1d.size, $"val".desc + 1)
      .select(spark_partition_id().as("id"), $"val").show()

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange rangepartitioning((val#236 DESC NULLS LAST + 1) ASC NULLS FIRST, 10)
+- LocalTableScan [val#236]

	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:116)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a more generic problem right? I think a similar error gets thrown if you do something like this: spark.range(10).select($"id".asc + 1).show()

Let's fix that in a different ticket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot evaluate expression: input[0, bigint, false] ASC NULLS FIRST
java.lang.UnsupportedOperationException: Cannot evaluate expression: input[0, bigint, false] ASC NULLS FIRST
	at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:259)
	at org.apache.spark.sql.catalyst.expressions.SortOrder.doGenCode(SortOrder.scala:60)

Yeah. It also does not work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is slightly different because the project is whole stage code generated.

require(partitionExpressions.nonEmpty, "At least one partition-by expression must be specified.")

val partitioning: Partitioning = {
val (sortOrder, nonSortOrder) = partitionExpressions.partition(_.isInstanceOf[SortOrder])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still the same question. What happened when the SortOrder is not at the root node.

Copy link
Contributor Author

@adrian-ionescu adrian-ionescu Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going to follow the HashPartitioning path and eventually lead to a "Cannot evaluate expression" exception, just like it would presently do if you tried running df.repartition($"col".asc + 1) or df.sort($"col".asc + 1)

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84223 has finished for PR 19828 at commit 950b3dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84222 has finished for PR 19828 at commit 950b3dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84225 has finished for PR 19828 at commit 671e9e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM.

Like what @hvanhovell suggested, we can fix it in the follow-up PR. Thanks!

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #84247 has finished for PR 19828 at commit fe690fc.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-ionescu
Copy link
Contributor Author

[error] running /home/jenkins/workspace/SparkPullRequestBuilder/build/sbt -Phadoop-2.6 -Phive-thriftserver -Phive -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest hive-thriftserver/test mllib/test hive/test repl/test examples/test sql/test sql-kafka-0-10/test catalyst/test ; process was terminated by signal 9

Does not seem to be related.. can we retrigger the tests?

@adrian-ionescu
Copy link
Contributor Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #84252 has finished for PR 19828 at commit fe690fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #84258 has finished for PR 19828 at commit 66b192d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"${getClass.getSimpleName} expects that either all its `partitionExpressions` are of type " +
"`SortOrder`, which means `RangePartitioning`, or none of them are `SortOrder`, which " +
"means `HashPartitioning`. In this case we have:" +
s""""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" * 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still exists after we revert the previous changes.


/**
* Returns a new Dataset partitioned by the given partitioning expressions into
* `numPartitions`. The resulting Dataset is range partitioned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update this to describe the latest change?

@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #84296 has finished for PR 19828 at commit 012d617.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Returns a new Dataset that is hash partitioned by the given expressions into `numPartitions`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash -> range

@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #84308 has finished for PR 19828 at commit 60ec0e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2017

Test build #84338 has finished for PR 19828 at commit 67ca139.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"${getClass.getSimpleName} expects that either all its `partitionExpressions` are of type " +
"`SortOrder`, which means `RangePartitioning`, or none of them are `SortOrder`, which " +
"means `HashPartitioning`. In this case we have:" +
s""""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still exists after we revert the previous changes.


require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")

require(partitionExpressions.nonEmpty, "At least one partition-by expression must be specified.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for safety, also keep this change?

Copy link
Contributor Author

@adrian-ionescu adrian-ionescu Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would change the current behavior of .repartition(numPartitions, Seq.empty: _*) and I'd like to avoid that.

In fact, I've just raised a separate ticket about the latter: SPARK-22665

@SparkQA
Copy link

SparkQA commented Nov 30, 2017

Test build #84347 has finished for PR 19828 at commit 012baa0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in f5f8e84 Nov 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants