Skip to content

Conversation

@mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

The PR add the slice function. The behavior of the function is based on Presto's one.

The function slices an array according to the requested start index and length.

How was this patch tested?

added UTs

|} else {
| $values = new Object[$resLength];
| for (int $i = 0; $i < $resLength; $i ++) {
| $values[$i] = ${CodeGenerator.getValue(x, elementType, s"$i + $startIdx")};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May this assignment cause performance degradation due to boxing if array element type is primitive (e.g. float)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though about that too, but I am not sure there is a better solution: this approach is used both in CreateArray and GenerateSafeProjection. And there is a TODO for specialized versions of GenericArrayData able to deal with primitive types without boxing.

Probably we can try and fix this TODO in another PR/JIRA. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If we postpone specialization, is it necessary to generate Java code for now? The generated code seems to do the same thing in nullSafeEval. WDYT?

Copy link
Contributor Author

@mgaido91 mgaido91 Apr 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be helpful: they are not really doing the same thing anyway. Moreover, this is the way also CreateArray and GenerateSafeProjection work, so for coherency I think this is the right thing to do. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future, I agree that this is the right way to generate Java code since we can avoid boxing.

On the other hand, you are proposing to postpone specialization. In eval and generated code, GenericArrayData is generated by using Object[].
I may misunderstand for coherency since I may not find the target of the coherency in the thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My target of coherency was the CreateArray operator and the code generated in GenerateSafeProjection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might miss something, but seems like CreateArray is using different ways to codegen for primitive arrays and the others, and I guess GenerateSafeProjection is using Object[] on purpose to create GenericArrayData to be "safe" (avoid using UnsafeXxx).
I think we should modify this codegen to avoid boxing. WDYT?

Btw, we need to null check here for an array of primitive type contains null anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I am not sure why I missed it...maybe I checked outdated code. Sorry, I am fixing it, thanks.

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89194 has finished for PR 21040 at commit 5cbbf7a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Slice(x: Expression, start: Expression, length: Expression)

return new GenericArrayData(Array.empty[AnyRef])
}
val elementType = x.dataType.asInstanceOf[ArrayType].elementType
val data = arr.toArray[AnyRef](elementType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR #20984 can make slice better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we wait for that PR to get in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good since we can avoid the whole array copy if that PR will be merged near future.
@viirya What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #20984 should be merged soon.

@gatorsmile
Copy link
Member

cc @ueshin

*/
// scalastyle:off line.size.limit
@ExpressionDescription(
usage = "_FUNC_(a1, a2) - Subsets array x starting from index start (or starting from the end if start is negative) with the specified length.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_FUNC_(x, start, length) instead of _FUNC_(a1, a2)?


override def nullable: Boolean = children.exists(_.nullable)

override def foldable: Boolean = children.forall(_.foldable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need nullable and foldable here because these are the same as defined in TernaryExpression.

}
if (lengthInt < 0) {
throw new RuntimeException(s"Unexpected value for length in function $prettyName: " +
s"length must be greater than or equal to 0.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary s.

}
// this can happen if start is negative and its absolute value is greater than the
// number of elements in the array
if (startIndex < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also skip when startIndex >= arr.numElements() to avoid unnecessary convert arr.toArray?

val arr = xVal.asInstanceOf[ArrayData]
val startIndex = if (startInt == 0) {
throw new RuntimeException(
s"Unexpected value for start in function $prettyName: SQL array indices start at 1.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove an extra space between $prettyName: and SQL.

checkEvaluation(Slice(a0, Literal.create(null, IntegerType), Literal(2)), null)
checkEvaluation(Slice(a0, Literal(2), Literal.create(null, IntegerType)), null)
checkEvaluation(Slice(Literal.create(null, ArrayType(IntegerType)), Literal(1), Literal(2)),
null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a case for something like Slice(a0, Literal(10), Literal(1))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also can you add a case for nullable primitive array like Slice(Seq(1, 2, null, 4), 2, 3)?

}

protected def checkExceptionInExpression[T <: Throwable : ClassTag](
expression: Expression,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expression: => Expression to be consistent with the overloaded one, just in case?

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89640 has finished for PR 21040 at commit b94d067.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Apr 20, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89657 has finished for PR 21040 at commit b94d067.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

any more comments @kiszk @ueshin ?

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for the delay.
I left some comments. Thanks!

|} else {
| $values = new Object[$resLength];
| for (int $i = 0; $i < $resLength; $i ++) {
| $values[$i] = ${CodeGenerator.getValue(x, elementType, s"$i + $startIdx")};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might miss something, but seems like CreateArray is using different ways to codegen for primitive arrays and the others, and I guess GenerateSafeProjection is using Object[] on purpose to create GenericArrayData to be "safe" (avoid using UnsafeXxx).
I think we should modify this codegen to avoid boxing. WDYT?

Btw, we need to null check here for an array of primitive type contains null anyway?

checkEvaluation(Slice(a0, Literal.create(null, IntegerType), Literal(2)), null)
checkEvaluation(Slice(a0, Literal(2), Literal.create(null, IntegerType)), null)
checkEvaluation(Slice(Literal.create(null, ArrayType(IntegerType)), Literal(1), Literal(2)),
null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also can you add a case for nullable primitive array like Slice(Seq(1, 2, null, 4), 2, 3)?

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89923 has finished for PR 21040 at commit 72ed607.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayJoin(
  • case class Flatten(child: Expression) extends UnaryExpression
  • case class MonthsBetween(
  • trait QueryPlanConstraints extends ConstraintHelper
  • trait ConstraintHelper
  • case class CachedRDDBuilder(
  • case class InMemoryRelation(
  • case class WriteToContinuousDataSource(
  • case class WriteToContinuousDataSourceExec(writer: StreamWriter, query: SparkPlan)

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for a nit.

ev: ExprCode,
inputArray: String,
startIdx: String,
resLength: String): String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

@SparkQA
Copy link

SparkQA commented Apr 30, 2018

Test build #89977 has finished for PR 21040 at commit 9d65570.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

retest this please

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

@ueshin
Copy link
Member

ueshin commented May 1, 2018

Jenkins, retest this please.

@kiszk
Copy link
Member

kiszk commented May 1, 2018

retest this please

@ueshin
Copy link
Member

ueshin commented May 1, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented May 1, 2018

Test build #89989 has finished for PR 21040 at commit 9d65570.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented May 1, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 1, 2018

Test build #90005 has finished for PR 21040 at commit 9d65570.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

| UnsafeArrayData.calculateHeaderPortionInBytes($resLength) +
| ${classOf[ByteArrayMethods].getName}.roundNumberOfBytesToNearestWord(
| ${elementType.defaultSize} * $resLength);
|byte[] $bytesArray = new byte[$sizeInBytes];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if sizeInBytes is larger than Integer.MAX_VALUE? For example, 0x7000_0000 long elements. In this case, GenericArrayData or long[] can hold these elements. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places (eg Concat) in such a case we just throw a runtime exception. What about following the same pattern here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not even sure we have to add such a check actually, since here we can only reduce the size of an already existing array... Anyway probably it is ok to add an additional sanity check. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about the following two cases.

  1. In UnsafeArray, long[] may be used. Its size is 0x8000_0000 * 4. On the other hand, the size is the allocated byte[] is up to 0x8000_0000.
  2. If GenericArray, which includes a lot of (e.g. 0x7F00_0000) Long or Double elements, is passed to this operation, the expected allocation size is more than 0x8000_0000.

While these cases reduce the size of an existing array, does the result array fit into byte[]? WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the same check which is performed in Concat and Flatten. If we want to support also larger arrays of primitives, we probably best have another PR which address the issue on all the functions affected (this one, Concat and Flatten), especially considering that the issue is much more likely to happen in the other two cases. Do you agree?

@SparkQA
Copy link

SparkQA commented May 4, 2018

Test build #90202 has finished for PR 21040 at commit e2eb21e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2018

Test build #90196 has finished for PR 21040 at commit 9f0deec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ImplicitTypeCasts(conf: SQLConf) extends TypeCoercionRule
  • case class StringToTimestampWithoutTimezone(child: Expression, timeZoneId: Option[String] = None)

@SparkQA
Copy link

SparkQA commented May 4, 2018

Test build #90203 has finished for PR 21040 at commit 07604e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented May 7, 2018

Thanks! merging to master.

@asfgit asfgit closed this in e35ad3c May 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants