-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23930][SQL] Add slice function #21040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| |} else { | ||
| | $values = new Object[$resLength]; | ||
| | for (int $i = 0; $i < $resLength; $i ++) { | ||
| | $values[$i] = ${CodeGenerator.getValue(x, elementType, s"$i + $startIdx")}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May this assignment cause performance degradation due to boxing if array element type is primitive (e.g. float)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though about that too, but I am not sure there is a better solution: this approach is used both in CreateArray and GenerateSafeProjection. And there is a TODO for specialized versions of GenericArrayData able to deal with primitive types without boxing.
Probably we can try and fix this TODO in another PR/JIRA. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If we postpone specialization, is it necessary to generate Java code for now? The generated code seems to do the same thing in nullSafeEval. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be helpful: they are not really doing the same thing anyway. Moreover, this is the way also CreateArray and GenerateSafeProjection work, so for coherency I think this is the right thing to do. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future, I agree that this is the right way to generate Java code since we can avoid boxing.
On the other hand, you are proposing to postpone specialization. In eval and generated code, GenericArrayData is generated by using Object[].
I may misunderstand for coherency since I may not find the target of the coherency in the thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My target of coherency was the CreateArray operator and the code generated in GenerateSafeProjection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might miss something, but seems like CreateArray is using different ways to codegen for primitive arrays and the others, and I guess GenerateSafeProjection is using Object[] on purpose to create GenericArrayData to be "safe" (avoid using UnsafeXxx).
I think we should modify this codegen to avoid boxing. WDYT?
Btw, we need to null check here for an array of primitive type contains null anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I am not sure why I missed it...maybe I checked outdated code. Sorry, I am fixing it, thanks.
|
Test build #89194 has finished for PR 21040 at commit
|
| return new GenericArrayData(Array.empty[AnyRef]) | ||
| } | ||
| val elementType = x.dataType.asInstanceOf[ArrayType].elementType | ||
| val data = arr.toArray[AnyRef](elementType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR #20984 can make slice better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we wait for that PR to get in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good since we can avoid the whole array copy if that PR will be merged near future.
@viirya What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #20984 should be merged soon.
|
cc @ueshin |
| */ | ||
| // scalastyle:off line.size.limit | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(a1, a2) - Subsets array x starting from index start (or starting from the end if start is negative) with the specified length.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_FUNC_(x, start, length) instead of _FUNC_(a1, a2)?
|
|
||
| override def nullable: Boolean = children.exists(_.nullable) | ||
|
|
||
| override def foldable: Boolean = children.forall(_.foldable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need nullable and foldable here because these are the same as defined in TernaryExpression.
| } | ||
| if (lengthInt < 0) { | ||
| throw new RuntimeException(s"Unexpected value for length in function $prettyName: " + | ||
| s"length must be greater than or equal to 0.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary s.
| } | ||
| // this can happen if start is negative and its absolute value is greater than the | ||
| // number of elements in the array | ||
| if (startIndex < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also skip when startIndex >= arr.numElements() to avoid unnecessary convert arr.toArray?
| val arr = xVal.asInstanceOf[ArrayData] | ||
| val startIndex = if (startInt == 0) { | ||
| throw new RuntimeException( | ||
| s"Unexpected value for start in function $prettyName: SQL array indices start at 1.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove an extra space between $prettyName: and SQL.
| checkEvaluation(Slice(a0, Literal.create(null, IntegerType), Literal(2)), null) | ||
| checkEvaluation(Slice(a0, Literal(2), Literal.create(null, IntegerType)), null) | ||
| checkEvaluation(Slice(Literal.create(null, ArrayType(IntegerType)), Literal(1), Literal(2)), | ||
| null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a case for something like Slice(a0, Literal(10), Literal(1))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also can you add a case for nullable primitive array like Slice(Seq(1, 2, null, 4), 2, 3)?
| } | ||
|
|
||
| protected def checkExceptionInExpression[T <: Throwable : ClassTag]( | ||
| expression: Expression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expression: => Expression to be consistent with the overloaded one, just in case?
|
Test build #89640 has finished for PR 21040 at commit
|
|
retest this please |
|
Test build #89657 has finished for PR 21040 at commit
|
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for the delay.
I left some comments. Thanks!
| |} else { | ||
| | $values = new Object[$resLength]; | ||
| | for (int $i = 0; $i < $resLength; $i ++) { | ||
| | $values[$i] = ${CodeGenerator.getValue(x, elementType, s"$i + $startIdx")}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might miss something, but seems like CreateArray is using different ways to codegen for primitive arrays and the others, and I guess GenerateSafeProjection is using Object[] on purpose to create GenericArrayData to be "safe" (avoid using UnsafeXxx).
I think we should modify this codegen to avoid boxing. WDYT?
Btw, we need to null check here for an array of primitive type contains null anyway?
| checkEvaluation(Slice(a0, Literal.create(null, IntegerType), Literal(2)), null) | ||
| checkEvaluation(Slice(a0, Literal(2), Literal.create(null, IntegerType)), null) | ||
| checkEvaluation(Slice(Literal.create(null, ArrayType(IntegerType)), Literal(1), Literal(2)), | ||
| null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also can you add a case for nullable primitive array like Slice(Seq(1, 2, null, 4), 2, 3)?
|
Test build #89923 has finished for PR 21040 at commit
|
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for a nit.
| ev: ExprCode, | ||
| inputArray: String, | ||
| startIdx: String, | ||
| resLength: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
|
Test build #89977 has finished for PR 21040 at commit
|
|
retest this please |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
|
Jenkins, retest this please. |
|
retest this please |
|
Jenkins, retest this please. |
|
Test build #89989 has finished for PR 21040 at commit
|
|
retest this please |
|
Test build #90005 has finished for PR 21040 at commit
|
| | UnsafeArrayData.calculateHeaderPortionInBytes($resLength) + | ||
| | ${classOf[ByteArrayMethods].getName}.roundNumberOfBytesToNearestWord( | ||
| | ${elementType.defaultSize} * $resLength); | ||
| |byte[] $bytesArray = new byte[$sizeInBytes]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if sizeInBytes is larger than Integer.MAX_VALUE? For example, 0x7000_0000 long elements. In this case, GenericArrayData or long[] can hold these elements. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places (eg Concat) in such a case we just throw a runtime exception. What about following the same pattern here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not even sure we have to add such a check actually, since here we can only reduce the size of an already existing array... Anyway probably it is ok to add an additional sanity check. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about the following two cases.
- In
UnsafeArray,long[]may be used. Its size is0x8000_0000 * 4. On the other hand, the size is the allocatedbyte[]is up to0x8000_0000. If GenericArray, which includes a lot of (e.g.0x7F00_0000)LongorDoubleelements, is passed to this operation, the expected allocation size is more than0x8000_0000.
While these cases reduce the size of an existing array, does the result array fit into byte[]? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the same check which is performed in Concat and Flatten. If we want to support also larger arrays of primitives, we probably best have another PR which address the issue on all the functions affected (this one, Concat and Flatten), especially considering that the issue is much more likely to happen in the other two cases. Do you agree?
|
Test build #90202 has finished for PR 21040 at commit
|
|
Test build #90196 has finished for PR 21040 at commit
|
|
Test build #90203 has finished for PR 21040 at commit
|
|
Thanks! merging to master. |
What changes were proposed in this pull request?
The PR add the
slicefunction. The behavior of the function is based on Presto's one.The function slices an array according to the requested start index and length.
How was this patch tested?
added UTs