Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Apr 10, 2018

What changes were proposed in this pull request?

The PR adds the SQL function cardinality. The behavior of the function is based on Presto's one.

The function returns the length of the array or map stored in the column as int while the Presto version returns the value as BigInt (long in Spark). The discussions regarding the difference of return type are here and there.

How was this patch tested?

Added UTs

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89141 has finished for PR 21031 at commit 8945854.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Cardinality(child: Expression) extends UnaryExpression with ExpectsInputTypes

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89150 has finished for PR 21031 at commit d405d8a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 11, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89169 has finished for PR 21031 at commit d405d8a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 11, 2018

retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we extend Size instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better way to extend a case class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see... maybe a common abstract class as you did in another case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing scala doc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thanks

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89189 has finished for PR 21031 at commit d405d8a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: what about adding a def resultTypeBigInt (or something similar) which is overridden by the subclasses instead of having it as a argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to elaborate on the advantage of adding def resultTypeBigInt instead of passing as an argument?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this way you can directly write the eval and doGenCode methods and in the Size and Cardinality classes we just need to override def resultTypeBigInt setting it to true or false. I think it is cleaner, but it is not a big deal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, I realized dataType can be used for this purpose. Thank you for your comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even better!

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89206 has finished for PR 21031 at commit 67f8716.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class SizeUtil extends UnaryExpression with ExpectsInputTypes
  • case class Size(child: Expression) extends SizeUtil
  • case class Cardinality(child: Expression) extends SizeUtil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigInt is Presto's data type name, I think we should use SparkSQL's data type here. Btw, I think we should use LongType instead of DecimalType.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update #21037, too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is quite pointless to have this and def doGenCode here and in Size. Can't we just implement eval and doGenCode in SizeUtil? ie. rename there doSizeGenCode -> doGenCode and sizeEval -> eval

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89254 has finished for PR 21031 at commit b4f9839.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89258 has finished for PR 21031 at commit 11fe6f7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 12, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89269 has finished for PR 21031 at commit 11fe6f7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 12, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89280 has finished for PR 21031 at commit 11fe6f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigInt -> long

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks

@mgaido91
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Apr 13, 2018

Test build #89332 has finished for PR 21031 at commit a21f85b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Apr 13, 2018

If there is already size, why do we need to create a new implementation? Why can't we just rewrite cardinality to size?

Also I wouldn't add any programming API for this, since there is already size.

@kiszk
Copy link
Member Author

kiszk commented Apr 14, 2018

According to my understanding, these activities are to improve compatibility with other DBs (like Presto) in https://issues.apache.org/jira/browse/SPARK-23899 and https://issues.apache.org/jira/browse/SPARK-23923.

As you pointed out, cardinality and size has the same except data type. I used the same implementation.

@gatorsmile what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not add the function APIs. Adding the SQL function is enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expression[Size]("cardinality"), Does this work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Presto, cardinality's return type is BigInt. Thus, cardinality in Spark uses uses long as return type. If we use int as cardinality's return type, I think that it works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be fine in most cases. If we really need it In the future, we can add new APIs with better names later?

Like in SQL, we have COUNT and COUNT_BIG. COUNT_BIG is similar to COUNT but the result can be grater than the max value of integer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I will update this PR with only expression[Cardinality]("cardinality"). I will also update the description and JIRA later.

Is it OK with you? @gatorsmile cc: @ueshin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: Do we need @ExpressionDescription? If we need this, we may need to create Cardinality case class that is the same as Size. Of course, Cardinality and Size can extend the same parent class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alias names are implemented like that. No need to create a new one. We did it in the other SQL functions too. like char, char_length and many. We can update the description in the existing ExpressionDescription

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your clarification. I see.

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89646 has finished for PR 21031 at commit c6e8d81.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes



@since(2.4)
def cardinality(col):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also remove this? I think we just need to do it for SQL functions now. In the future, if this is requested by the community, we can reconsider it. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@SparkQA
Copy link

SparkQA commented Apr 21, 2018

Test build #89669 has finished for PR 21031 at commit dd46bbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@kiszk Could you also update the PR description? LGTM

@kiszk
Copy link
Member Author

kiszk commented Apr 22, 2018

Sure, done.

@gatorsmile
Copy link
Member

retest this please

@gatorsmile
Copy link
Member

LGTM pending Jenkins

@ueshin
Copy link
Member

ueshin commented May 1, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented May 1, 2018

Test build #89990 has finished for PR 21031 at commit dd46bbf.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90017 has finished for PR 21031 at commit dd46bbf.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented May 2, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90026 has finished for PR 21031 at commit dd46bbf.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented May 2, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90046 has finished for PR 21031 at commit dd46bbf.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented May 2, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90068 has finished for PR 21031 at commit dd46bbf.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

The SparkR test failure is not related. Thanks! Merged to master.

@asfgit asfgit closed this in 5be8aab May 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants