-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23919][SQL] Add array_position function #21037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #89176 has finished for PR 21037 at commit
|
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since :)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch
|
Test build #89188 has finished for PR 21037 at commit
|
|
Test build #89200 has finished for PR 21037 at commit
|
|
Test build #89218 has finished for PR 21037 at commit
|
|
Is array_position a string function? The function array_contains, for example, works on arrays. Also, this new function seems very similar as the existing instr ("Locate the position of the first occurrence of substr column in the given string"), except the return type is BigDecimal rather than Integer. For example, I ran a query using your branch: scala> val df = Seq((Seq(1, 2, 3), "this is a test"), (Seq(7, 8, 9, 3), "yeah this")).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: array, b: string]
scala> df.select(array_contains('a, 2), array_position('b, "this"), instr('b, "this")).show
+--------------------+-----------------------+--------------+
|array_contains(a, 2)|array_position(b, this)|instr(b, this)|
+--------------------+-----------------------+--------------+
| true| 1| 1|
| false| 6| 6|
+--------------------+-----------------------+--------------+
scala>
scala> df.select(array_position('a, 3)).show
:35: error: type mismatch;
found : Int(3)
required: String
df.select(array_position('a, 3)).show
^
scala> df.select(array_contains('a, 3)).show
+--------------------+
|array_contains(a, 3)|
+--------------------+
| true|
| true|
+--------------------+
scala>
|
|
@bersprockets You are absolutely right. I made mistake. I will completely reimplement this soon. |
|
Test build #89238 has finished for PR 21037 at commit
|
|
Test build #89255 has finished for PR 21037 at commit
|
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is a broken sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behavior when left contains null element?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, if an array in left contains a null element, as you can see a UT, it works as an usual element.
val left = Literal.create(Seq[String](null, ""), ArrayType(StringType)) // contains null
checkEvaluation(ArrayPosition(left, Literal(""), 2L)
checkEvaluation(ArrayPosition(left, Literal.create(null, StringType)), null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can't know the position of null in the array even if the array contains null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to these UTs in Presto, you are right.
|
Test build #89268 has finished for PR 21037 at commit
|
|
Test build #89314 has finished for PR 21037 at commit
|
|
retest this please |
|
Test build #89322 has finished for PR 21037 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stripMargin is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, sorry again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: $pos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess left.dataType.asInstanceOf[ArrayType].containsNull is not related to the nullability of this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can't know the position of null in the array even if the array contains null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to check null for each array element?
E.g. for the test Array Position in CollectionExpressionsSuite, if a0 is as follows:
val a0 = Literal.create(Seq(1, null, 2, 3), ArrayType(IntegerType))
checkEvaluation(ArrayPosition(a0, Literal(0)), 0L) will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree with you. I added one test case for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove an extra line.
|
Test build #89411 has finished for PR 21037 at commit
|
|
@ueshin could you please review again? |
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for some nits.
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns 0 if substr could not be found in str -> Returns 0 if the value could not be found in the array or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns 0 if substr could not be found in str -> Returns 0 if the value could not be found in the array or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this here because this is the same as the one in BinaryExpression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove that?
The first character in str has index 1. -> The first element in the array has index 1. or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary import?
|
Test build #89506 has finished for PR 21037 at commit
|
|
Thanks! merging to master. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too.
| > SELECT _FUNC_(array(3, 2, 1), 1); | ||
| 3 | ||
| """, | ||
| since = "2.4.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to note that we can use note here too:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Lines 101 to 103 in 2ce37b5
| note = """ | |
| Use RLIKE to match with standard regular expressions. | |
| """) |
I am mentioning this because we are adding many functions now :-).
What changes were proposed in this pull request?
The PR adds the SQL function
array_position. The behavior of the function is based on Presto's one.The function returns the position of the first occurrence of the element in array x (or 0 if not found) using 1-based index as BigInt.
How was this patch tested?
Added UTs