[SPARK-23923][SQL] Add cardinality function #21031

kiszk · 2018-04-10T16:56:51Z

What changes were proposed in this pull request?

The PR adds the SQL function cardinality. The behavior of the function is based on Presto's one.

The function returns the length of the array or map stored in the column as int while the Presto version returns the value as BigInt (long in Spark). The discussions regarding the difference of return type are here and there.

How was this patch tested?

Added UTs

SparkQA · 2018-04-10T17:04:04Z

Test build #89141 has finished for PR 21031 at commit 8945854.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Cardinality(child: Expression) extends UnaryExpression with ExpectsInputTypes

SparkQA · 2018-04-10T20:44:57Z

Test build #89150 has finished for PR 21031 at commit d405d8a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-04-11T04:39:12Z

retest this please

SparkQA · 2018-04-11T07:05:01Z

Test build #89169 has finished for PR 21031 at commit d405d8a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-04-11T11:08:58Z

retest this please

mgaido91 · 2018-04-11T13:30:37Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

can't we extend Size instead?

Is it better way to extend a case class?

oh, I see... maybe a common abstract class as you did in another case?

mgaido91 · 2018-04-11T13:31:49Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

missing scala doc

good catch, thanks

SparkQA · 2018-04-11T13:48:32Z

Test build #89189 has finished for PR 21031 at commit d405d8a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-11T15:49:18Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

minor: what about adding a def resultTypeBigInt (or something similar) which is overridden by the subclasses instead of having it as a argument?

Would it be possible to elaborate on the advantage of adding def resultTypeBigInt instead of passing as an argument?

in this way you can directly write the eval and doGenCode methods and in the Size and Cardinality classes we just need to override def resultTypeBigInt setting it to true or false. I think it is cleaner, but it is not a big deal.

Now, I realized dataType can be used for this purpose. Thank you for your comment.

even better!

SparkQA · 2018-04-11T19:25:30Z

Test build #89206 has finished for PR 21031 at commit 67f8716.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class SizeUtil extends UnaryExpression with ExpectsInputTypes
case class Size(child: Expression) extends SizeUtil
case class Cardinality(child: Expression) extends SizeUtil

viirya · 2018-04-11T23:38:40Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

BigInt is Presto's data type name, I think we should use SparkSQL's data type here. Btw, I think we should use LongType instead of DecimalType.

Good catch, thanks.

I will update #21037, too

mgaido91 · 2018-04-12T10:28:10Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

it is quite pointless to have this and def doGenCode here and in Size. Can't we just implement eval and doGenCode in SizeUtil? ie. rename there doSizeGenCode -> doGenCode and sizeEval -> eval

Good catch, thanks

SparkQA · 2018-04-12T12:53:11Z

Test build #89254 has finished for PR 21031 at commit b4f9839.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-12T13:30:54Z

Test build #89258 has finished for PR 21031 at commit 11fe6f7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-04-12T13:51:04Z

retest this please

SparkQA · 2018-04-12T16:07:09Z

Test build #89269 has finished for PR 21031 at commit 11fe6f7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-04-12T16:48:14Z

retest this please

SparkQA · 2018-04-12T20:34:23Z

Test build #89280 has finished for PR 21031 at commit 11fe6f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-13T09:56:36Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

BigInt -> long

Good catch, thanks

mgaido91 · 2018-04-13T10:14:40Z

LGTM

SparkQA · 2018-04-13T13:48:45Z

Test build #89332 has finished for PR 21031 at commit a21f85b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-04-13T23:37:32Z

If there is already size, why do we need to create a new implementation? Why can't we just rewrite cardinality to size?

Also I wouldn't add any programming API for this, since there is already size.

kiszk · 2018-04-14T00:53:40Z

According to my understanding, these activities are to improve compatibility with other DBs (like Presto) in https://issues.apache.org/jira/browse/SPARK-23899 and https://issues.apache.org/jira/browse/SPARK-23923.

As you pointed out, cardinality and size has the same except data type. I used the same implementation.

@gatorsmile what do you think?

gatorsmile · 2018-04-16T04:52:48Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Do not add the function APIs. Adding the SQL function is enough.

gatorsmile · 2018-04-16T04:54:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

expression[Size]("cardinality"), Does this work?

In Presto, cardinality's return type is BigInt. Thus, cardinality in Spark uses uses long as return type. If we use int as cardinality's return type, I think that it works.

ping @gatorsmile

That might be fine in most cases. If we really need it In the future, we can add new APIs with better names later?

Like in SQL, we have COUNT and COUNT_BIG. COUNT_BIG is similar to COUNT but the result can be grater than the max value of integer.

I see. I will update this PR with only expression[Cardinality]("cardinality"). I will also update the description and JIRA later.

Is it OK with you? @gatorsmile cc: @ueshin

One question: Do we need @ExpressionDescription? If we need this, we may need to create Cardinality case class that is the same as Size. Of course, Cardinality and Size can extend the same parent class.

Alias names are implemented like that. No need to create a new one. We did it in the other SQL functions too. like char, char_length and many. We can update the description in the existing ExpressionDescription

Thank you for your clarification. I see.

update unit test

SparkQA · 2018-04-20T20:39:57Z

Test build #89646 has finished for PR 21031 at commit c6e8d81.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes

gatorsmile · 2018-04-20T20:53:50Z

python/pyspark/sql/functions.py



+@since(2.4)
+def cardinality(col):


Could you also remove this? I think we just need to do it for SQL functions now. In the future, if this is requested by the community, we can reconsider it. Thanks!

SparkQA · 2018-04-21T06:01:18Z

Test build #89669 has finished for PR 21031 at commit dd46bbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-04-22T06:37:53Z

@kiszk Could you also update the PR description? LGTM

kiszk · 2018-04-22T18:18:21Z

Sure, done.

gatorsmile · 2018-05-01T05:50:07Z

retest this please

gatorsmile · 2018-05-01T05:50:14Z

LGTM pending Jenkins

ueshin · 2018-05-01T17:38:47Z

Jenkins, retest this please.

SparkQA · 2018-05-01T20:55:15Z

Test build #89990 has finished for PR 21031 at commit dd46bbf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-01T22:45:31Z

retest this please

SparkQA · 2018-05-02T02:23:37Z

Test build #90017 has finished for PR 21031 at commit dd46bbf.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-02T02:37:43Z

retest this please

SparkQA · 2018-05-02T06:01:12Z

Test build #90026 has finished for PR 21031 at commit dd46bbf.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-02T11:00:29Z

retest this please

SparkQA · 2018-05-02T14:15:53Z

Test build #90046 has finished for PR 21031 at commit dd46bbf.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-02T17:01:22Z

retest this please

SparkQA · 2018-05-02T20:45:21Z

Test build #90068 has finished for PR 21031 at commit dd46bbf.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-02T20:52:37Z

The SparkR test failure is not related. Thanks! Merged to master.

kiszk force-pushed the SPARK-23923 branch from 8945854 to d405d8a Compare April 10, 2018 18:13

mgaido91 reviewed Apr 11, 2018

View reviewed changes

viirya reviewed Apr 11, 2018

View reviewed changes

kiszk force-pushed the SPARK-23923 branch from 67f8716 to b4f9839 Compare April 12, 2018 09:58

mgaido91 reviewed Apr 12, 2018

View reviewed changes

mgaido91 reviewed Apr 13, 2018

View reviewed changes

gatorsmile reviewed Apr 16, 2018

View reviewed changes

kiszk added 3 commits April 21, 2018 01:53

initial commit

f11a918

rebase with master

a392ee2

address review comments

ee8899b

kiszk added 4 commits April 21, 2018 01:53

use long for result instead of BigInt

c4b9d4e

address review comment

060ba08

update unit test

address review comment

38abd90

address review comments

c6e8d81

kiszk force-pushed the SPARK-23923 branch from a21f85b to c6e8d81 Compare April 20, 2018 17:29

gatorsmile reviewed Apr 20, 2018

View reviewed changes

drop cardinality from pyspark

dd46bbf

asfgit closed this in 5be8aab May 2, 2018



		@since(2.4)
		def cardinality(col):

[SPARK-23923][SQL] Add cardinality function #21031

[SPARK-23923][SQL] Add cardinality function #21031

Uh oh!

Conversation

kiszk commented Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

kiszk commented Apr 11, 2018

Uh oh!

SparkQA commented Apr 11, 2018

Uh oh!

kiszk commented Apr 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

kiszk commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

kiszk commented Apr 12, 2018

Uh oh!

SparkQA commented Apr 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Apr 13, 2018

Uh oh!

SparkQA commented Apr 13, 2018

Uh oh!

rxin commented Apr 13, 2018

Uh oh!

kiszk commented Apr 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kiszk commented Apr 10, 2018 •

edited

Loading