Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Apr 10, 2018

What changes were proposed in this pull request?

The PR adds the SQL function array_sort. The behavior of the function is based on Presto's one.

The function sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.

How was this patch tested?

Added UTs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add since?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to add since in this file? I cannot find since in collectionOperations.scala. Of course, functions.scala and functions.py have since.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every expression needs it .. many of functions don't have it because that field was added from 2.3.0. You could do it:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which will be shown in SQL documentation and the extended description command in sql syntax

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting know the example.

@HyukjinKwon
Copy link
Member

Seems JIRA number wrong ..

Copy link
Member

@viirya viirya Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we let SortArray and ArraySort both implement this and don't have a default implementation here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@kiszk kiszk changed the title [SPARK-23916][SQL] Add array_sort function [SPARK-23921][SQL] Add array_sort function Apr 10, 2018
@kiszk
Copy link
Member Author

kiszk commented Apr 10, 2018

@HyukjinKwon thanks, fixed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Seq.empty[Int]

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89088 has finished for PR 21021 at commit 00ac4fb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ArraySortUtil extends ExpectsInputTypes with CodegenFallback
  • case class SortArray(base: Expression, ascendingOrder: Expression)
  • case class ArraySort(child: Expression) extends UnaryExpression with ArraySortUtil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...not support sorting array of type ${dt.simpleString} which is not orderable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original message is compatible with SortArray. Is it better to update the message in SortArray with this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? Can't we just reuse the previous SortArray fixing the right to be TrueLiteral?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see the result in UT, null handing is different. As you suggested, to reuse existing code as possible, I refactored by using ArraySortUtil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. Then can't we add another flag for it? I just find a bit weird that we have two functions which do basically the same thing. Just one has one more flag than the other and they behave differently with nulls. As a user I would not understand when to use one or the other. What do you think? I'd be also interested in @gatorsmile's point of view, since he created the JIRA for this.

Moreover, can we at least improve than the description of the SortArray case class in order to point out clearly how it behaves with nulls? Now there is no mention about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as you said they are doing similar things. Therefore, a new trait is not introduced to reuse as possible.
When one is subset of another one (e.g. size v.s. cardinality), we could take an approach that one calls another one. What I am doing in cardinality.

Good point about the description. I will add the description on how it works with null.

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89092 has finished for PR 21021 at commit 359fc7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89095 has finished for PR 21021 at commit 0a500d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @ueshin

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89144 has finished for PR 21021 at commit 0203920.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 11, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89170 has finished for PR 21021 at commit 0203920.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89179 has finished for PR 21021 at commit 0203920.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89525 has finished for PR 21021 at commit 26ef2bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 19, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89576 has finished for PR 21021 at commit 9977f64.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89577 has finished for PR 21021 at commit 9977f64.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

:param col: name of column or expression
>>> from pyspark.sql.functions import sort_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

:param col: name of column or expression
>>> from pyspark.sql.functions import array_sort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

}
// If -1, place null element at the end of the array
// If 1, place null element at the beginning of the array
protected def placeNullAtEnd: Int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the name is confusing.

How about nullOrder?
nullOrder = -1 means null is the least, place at the beginning for asc and at the end for desc, whereas 1 means the greatest, place at the end for asc and at the beginning for desc.
And maybe we should define the constants like:

object ArraySortUtil {
  type NullOrder = Int
  object NullOrder {
    val Least: NullOrder = -1
    val Greatest: NullOrder = 1
  }
}

>>> df = spark.createDataFrame([([2, 1, None, 3],),([1],),([],)], ['data'])
>>> df.select(array_sort(df.data).alias('r')).collect()
[Row(r=[1, 2, 3, None]), Row(r=[1]), Row(r=[])]
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: an extra space?

:param col: name of column or expression
>>> from pyspark.sql.functions import sort_array
>>> df = spark.createDataFrame([([2, 1, 3],),([1],),([],)], ['data'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include None to show where the None comes?

@SparkQA
Copy link

SparkQA commented Apr 29, 2018

Test build #89969 has finished for PR 21021 at commit d1b0483.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 29, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 30, 2018

Test build #89975 has finished for PR 21021 at commit d1b0483.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"org.apache.spark.sql.types.DataTypes.createDecimalType($p, $s)"
case ArrayType(et, cn) =>
s"org.apache.spark.sql.types.DataTypes.createArrayType($et, $cn)"
case MapType(kt, vt, cn) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need for MapType because MapType is not orderable?

s"apply(new java.util.ArrayList(${f.length}))"
case _ =>
s"org.apache.spark.sql.types.DataTypes.$elementType"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this will work for all complex types, e.g. ArrayType(ArrayType(IntegerType))?
How about using reference object of elementType?

s"org.apache.spark.sql.types.DataTypes.$elementType"
}
s"""
|Object[] $array = (Object[]) (($arrayData) $base).toArray(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using toObjectArray which doesn't need ClassTag?

val genericArrayData = classOf[GenericArrayData].getName
val array = ctx.freshName("array")
val c = ctx.freshName("c")
val sort = if (elementType == NullType) "" else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just copying the original array if elementType == NullType?

s"apply(new java.util.ArrayList(${f.length}))"
case _ =>
s"org.apache.spark.sql.types.$elementType$$.MODULE$$"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering whether this will work or not. What if elementType is ArrayType(ArrayType(IntegerType))?
Can't we use a reference object of elementType?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add some tests using ArrayType and StructType for elementType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, I added some complex test cases with nests.

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90021 has finished for PR 21021 at commit 9f63a76.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented May 2, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90031 has finished for PR 21021 at commit 9f63a76.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

1 similar comment
@kiszk
Copy link
Member Author

kiszk commented May 2, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90082 has finished for PR 21021 at commit 9f63a76.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented May 3, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90118 has finished for PR 21021 at commit 9f63a76.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2018

Test build #90163 has finished for PR 21021 at commit e3fcaaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for some nits.

val array = ctx.freshName("array")
val c = ctx.freshName("c")
if (elementType == NullType) {
s"${ev.value} = (($arrayData) $base).copy();"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need cast base to ArrayData?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, done

s"int $c = ${ctx.genComp(elementType, s"(($jt) $o1)", s"(($jt) $o2)")};"
}
s"""
|Object[] $array = (Object[]) (($arrayData) $base).toObjectArray($dataTypes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

if (elementType == NullType) {
s"${ev.value} = (($arrayData) $base).copy();"
} else {
val dataTypes = ctx.addReferenceObj("dataType", elementType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about elementTypeTerm or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

extends BinaryExpression with ExpectsInputTypes with CodegenFallback {

def this(e: Expression) = this(e, Literal(true))
trait ArraySortUtil extends ExpectsInputTypes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ArraySortLike?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thank you for your review while it is a long-holiday week in Japan.

val v1 = ctx.freshName("v1")
val v2 = ctx.freshName("v2")
s"""
|$jt $v1 = (($bt) $o1).${jt}Value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to enforce the boxing? An why do we need to cast to the java type in the non primitive scenario?

Copy link
Member Author

@kiszk kiszk May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is because compare() in java.util.Arrays.sort accepts two Object arguments. Thus, we do boxing here.
Now, I realized java.util.Arrays.sort has sort() method only for ascending. Let me use them for ascending and non-null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see, thanks.

s"int $c = ${ctx.genComp(elementType, s"(($jt) $o1)", s"(($jt) $o2)")};"
}
s"""
|Object[] $array = $base.toObjectArray($elementTypeTerm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid boxing? Probably it is quite hard for null handling, so we can ignore this. Did you try?

Copy link
Member Author

@kiszk kiszk May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since java.util.Arrays.sort accepts Object[], this PR always uses Object[]. For now, I support the cases for ascending and non-null primitive types by using java.util.Array.sort.
Like this thread, another PR can avoid boxing by using type-specialized sorting implementation in generic cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry that I cannot understand this phase Probably it is quite hard for null handling. Would it be possible to elaborate on your thought?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I added the handling of primitives in the PR you mentioned, so it handles that now.

My sentence meant that probably the main issue is to avoid problems with nulls for primitive types, since in that case the returned array doesn't contain null but the default value for null items IIUC. So probably it is some extra effort (we should also check if it is worth). I am fine having this done in a followup PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, main issue is to find or prepare sort implementations to handle required specification (ascending, descending, or ordering of null, with various types, in specialized implementation).
If we can use existing implementation for now, we can handle them in this PR. Otherwise, we would like to handle such a case in another PR since it requires to implement sort algorithms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably the performance gain would not be worth this effort. I agree going on with the current implementation.

@SparkQA
Copy link

SparkQA commented May 4, 2018

Test build #90189 has finished for PR 21021 at commit 2ad6bb8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ArraySortLike extends ExpectsInputTypes
  • case class ArraySort(child: Expression) extends UnaryExpression with ArraySortLike

@SparkQA
Copy link

SparkQA commented May 4, 2018

Test build #90190 has finished for PR 21021 at commit 2c4404c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2018

Test build #90198 has finished for PR 21021 at commit 21521d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented May 7, 2018

Thanks! merging to master.

@asfgit asfgit closed this in 7564a9a May 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants