-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23921][SQL] Add array_sort function #21021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add since?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to add since in this file? I cannot find since in collectionOperations.scala. Of course, functions.scala and functions.py have since.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every expression needs it .. many of functions don't have it because that field was added from 2.3.0. You could do it:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Line 511 in 2ca9bb0
| since = "2.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which will be shown in SQL documentation and the extended description command in sql syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for letting know the example.
|
Seems JIRA number wrong .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we let SortArray and ArraySort both implement this and don't have a default implementation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
@HyukjinKwon thanks, fixed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Seq.empty[Int]
|
Test build #89088 has finished for PR 21021 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...not support sorting array of type ${dt.simpleString} which is not orderable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original message is compatible with SortArray. Is it better to update the message in SortArray with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this? Can't we just reuse the previous SortArray fixing the right to be TrueLiteral?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see the result in UT, null handing is different. As you suggested, to reuse existing code as possible, I refactored by using ArraySortUtil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks. Then can't we add another flag for it? I just find a bit weird that we have two functions which do basically the same thing. Just one has one more flag than the other and they behave differently with nulls. As a user I would not understand when to use one or the other. What do you think? I'd be also interested in @gatorsmile's point of view, since he created the JIRA for this.
Moreover, can we at least improve than the description of the SortArray case class in order to point out clearly how it behaves with nulls? Now there is no mention about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as you said they are doing similar things. Therefore, a new trait is not introduced to reuse as possible.
When one is subset of another one (e.g. size v.s. cardinality), we could take an approach that one calls another one. What I am doing in cardinality.
Good point about the description. I will add the description on how it works with null.
|
Test build #89092 has finished for PR 21021 at commit
|
|
Test build #89095 has finished for PR 21021 at commit
|
|
cc @ueshin |
|
Test build #89144 has finished for PR 21021 at commit
|
|
retest this please |
|
Test build #89170 has finished for PR 21021 at commit
|
|
retest this please |
|
Test build #89179 has finished for PR 21021 at commit
|
|
Test build #89525 has finished for PR 21021 at commit
|
|
retest this please |
|
Test build #89576 has finished for PR 21021 at commit
|
|
Test build #89577 has finished for PR 21021 at commit
|
python/pyspark/sql/functions.py
Outdated
| :param col: name of column or expression | ||
| >>> from pyspark.sql.functions import sort_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
python/pyspark/sql/functions.py
Outdated
| :param col: name of column or expression | ||
| >>> from pyspark.sql.functions import array_sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| } | ||
| // If -1, place null element at the end of the array | ||
| // If 1, place null element at the beginning of the array | ||
| protected def placeNullAtEnd: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the name is confusing.
How about nullOrder?
nullOrder = -1 means null is the least, place at the beginning for asc and at the end for desc, whereas 1 means the greatest, place at the end for asc and at the beginning for desc.
And maybe we should define the constants like:
object ArraySortUtil {
type NullOrder = Int
object NullOrder {
val Least: NullOrder = -1
val Greatest: NullOrder = 1
}
}
python/pyspark/sql/functions.py
Outdated
| >>> df = spark.createDataFrame([([2, 1, None, 3],),([1],),([],)], ['data']) | ||
| >>> df.select(array_sort(df.data).alias('r')).collect() | ||
| [Row(r=[1, 2, 3, None]), Row(r=[1]), Row(r=[])] | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: an extra space?
python/pyspark/sql/functions.py
Outdated
| :param col: name of column or expression | ||
| >>> from pyspark.sql.functions import sort_array | ||
| >>> df = spark.createDataFrame([([2, 1, 3],),([1],),([],)], ['data']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should include None to show where the None comes?
|
Test build #89969 has finished for PR 21021 at commit
|
|
retest this please |
|
Test build #89975 has finished for PR 21021 at commit
|
| s"org.apache.spark.sql.types.DataTypes.createDecimalType($p, $s)" | ||
| case ArrayType(et, cn) => | ||
| s"org.apache.spark.sql.types.DataTypes.createArrayType($et, $cn)" | ||
| case MapType(kt, vt, cn) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need for MapType because MapType is not orderable?
| s"apply(new java.util.ArrayList(${f.length}))" | ||
| case _ => | ||
| s"org.apache.spark.sql.types.DataTypes.$elementType" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this will work for all complex types, e.g. ArrayType(ArrayType(IntegerType))?
How about using reference object of elementType?
| s"org.apache.spark.sql.types.DataTypes.$elementType" | ||
| } | ||
| s""" | ||
| |Object[] $array = (Object[]) (($arrayData) $base).toArray( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using toObjectArray which doesn't need ClassTag?
| val genericArrayData = classOf[GenericArrayData].getName | ||
| val array = ctx.freshName("array") | ||
| val c = ctx.freshName("c") | ||
| val sort = if (elementType == NullType) "" else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just copying the original array if elementType == NullType?
| s"apply(new java.util.ArrayList(${f.length}))" | ||
| case _ => | ||
| s"org.apache.spark.sql.types.$elementType$$.MODULE$$" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still wondering whether this will work or not. What if elementType is ArrayType(ArrayType(IntegerType))?
Can't we use a reference object of elementType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add some tests using ArrayType and StructType for elementType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, I added some complex test cases with nests.
|
Test build #90021 has finished for PR 21021 at commit
|
|
retest this please |
|
Test build #90031 has finished for PR 21021 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #90082 has finished for PR 21021 at commit
|
|
retest this please |
|
Test build #90118 has finished for PR 21021 at commit
|
|
Test build #90163 has finished for PR 21021 at commit
|
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for some nits.
| val array = ctx.freshName("array") | ||
| val c = ctx.freshName("c") | ||
| if (elementType == NullType) { | ||
| s"${ev.value} = (($arrayData) $base).copy();" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we need cast base to ArrayData?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, done
| s"int $c = ${ctx.genComp(elementType, s"(($jt) $o1)", s"(($jt) $o2)")};" | ||
| } | ||
| s""" | ||
| |Object[] $array = (Object[]) (($arrayData) $base).toObjectArray($dataTypes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| if (elementType == NullType) { | ||
| s"${ev.value} = (($arrayData) $base).copy();" | ||
| } else { | ||
| val dataTypes = ctx.addReferenceObj("dataType", elementType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about elementTypeTerm or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
| extends BinaryExpression with ExpectsInputTypes with CodegenFallback { | ||
|
|
||
| def this(e: Expression) = this(e, Literal(true)) | ||
| trait ArraySortUtil extends ExpectsInputTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about ArraySortLike?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thank you for your review while it is a long-holiday week in Japan.
| val v1 = ctx.freshName("v1") | ||
| val v2 = ctx.freshName("v2") | ||
| s""" | ||
| |$jt $v1 = (($bt) $o1).${jt}Value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to enforce the boxing? An why do we need to cast to the java type in the non primitive scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this is because compare() in java.util.Arrays.sort accepts two Object arguments. Thus, we do boxing here.
Now, I realized java.util.Arrays.sort has sort() method only for ascending. Let me use them for ascending and non-null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I see, thanks.
| s"int $c = ${ctx.genComp(elementType, s"(($jt) $o1)", s"(($jt) $o2)")};" | ||
| } | ||
| s""" | ||
| |Object[] $array = $base.toObjectArray($elementTypeTerm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid boxing? Probably it is quite hard for null handling, so we can ignore this. Did you try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since java.util.Arrays.sort accepts Object[], this PR always uses Object[]. For now, I support the cases for ascending and non-null primitive types by using java.util.Array.sort.
Like this thread, another PR can avoid boxing by using type-specialized sorting implementation in generic cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry that I cannot understand this phase Probably it is quite hard for null handling. Would it be possible to elaborate on your thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I added the handling of primitives in the PR you mentioned, so it handles that now.
My sentence meant that probably the main issue is to avoid problems with nulls for primitive types, since in that case the returned array doesn't contain null but the default value for null items IIUC. So probably it is some extra effort (we should also check if it is worth). I am fine having this done in a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, main issue is to find or prepare sort implementations to handle required specification (ascending, descending, or ordering of null, with various types, in specialized implementation).
If we can use existing implementation for now, we can handle them in this PR. Otherwise, we would like to handle such a case in another PR since it requires to implement sort algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably the performance gain would not be worth this effort. I agree going on with the current implementation.
|
Test build #90189 has finished for PR 21021 at commit
|
|
Test build #90190 has finished for PR 21021 at commit
|
|
Test build #90198 has finished for PR 21021 at commit
|
|
Thanks! merging to master. |
What changes were proposed in this pull request?
The PR adds the SQL function
array_sort. The behavior of the function is based on Presto's one.The function sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.
How was this patch tested?
Added UTs