-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23923][SQL] Add cardinality function #21031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #89141 has finished for PR 21031 at commit
|
|
Test build #89150 has finished for PR 21031 at commit
|
|
retest this please |
|
Test build #89169 has finished for PR 21031 at commit
|
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we extend Size instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better way to extend a case class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see... maybe a common abstract class as you did in another case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing scala doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, thanks
|
Test build #89189 has finished for PR 21031 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: what about adding a def resultTypeBigInt (or something similar) which is overridden by the subclasses instead of having it as a argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to elaborate on the advantage of adding def resultTypeBigInt instead of passing as an argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this way you can directly write the eval and doGenCode methods and in the Size and Cardinality classes we just need to override def resultTypeBigInt setting it to true or false. I think it is cleaner, but it is not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, I realized dataType can be used for this purpose. Thank you for your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even better!
|
Test build #89206 has finished for PR 21031 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BigInt is Presto's data type name, I think we should use SparkSQL's data type here. Btw, I think we should use LongType instead of DecimalType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update #21037, too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is quite pointless to have this and def doGenCode here and in Size. Can't we just implement eval and doGenCode in SizeUtil? ie. rename there doSizeGenCode -> doGenCode and sizeEval -> eval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks
|
Test build #89254 has finished for PR 21031 at commit
|
|
Test build #89258 has finished for PR 21031 at commit
|
|
retest this please |
|
Test build #89269 has finished for PR 21031 at commit
|
|
retest this please |
|
Test build #89280 has finished for PR 21031 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BigInt -> long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks
|
LGTM |
|
Test build #89332 has finished for PR 21031 at commit
|
|
If there is already size, why do we need to create a new implementation? Why can't we just rewrite cardinality to size? Also I wouldn't add any programming API for this, since there is already size. |
|
According to my understanding, these activities are to improve compatibility with other DBs (like Presto) in https://issues.apache.org/jira/browse/SPARK-23899 and https://issues.apache.org/jira/browse/SPARK-23923. As you pointed out, @gatorsmile what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not add the function APIs. Adding the SQL function is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expression[Size]("cardinality"), Does this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Presto, cardinality's return type is BigInt. Thus, cardinality in Spark uses uses long as return type. If we use int as cardinality's return type, I think that it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be fine in most cases. If we really need it In the future, we can add new APIs with better names later?
Like in SQL, we have COUNT and COUNT_BIG. COUNT_BIG is similar to COUNT but the result can be grater than the max value of integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will update this PR with only expression[Cardinality]("cardinality"). I will also update the description and JIRA later.
Is it OK with you? @gatorsmile cc: @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: Do we need @ExpressionDescription? If we need this, we may need to create Cardinality case class that is the same as Size. Of course, Cardinality and Size can extend the same parent class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alias names are implemented like that. No need to create a new one. We did it in the other SQL functions too. like char, char_length and many. We can update the description in the existing ExpressionDescription
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your clarification. I see.
|
Test build #89646 has finished for PR 21031 at commit
|
python/pyspark/sql/functions.py
Outdated
|
|
||
|
|
||
| @since(2.4) | ||
| def cardinality(col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also remove this? I think we just need to do it for SQL functions now. In the future, if this is requested by the community, we can reconsider it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
|
Test build #89669 has finished for PR 21031 at commit
|
|
@kiszk Could you also update the PR description? LGTM |
|
Sure, done. |
|
retest this please |
|
LGTM pending Jenkins |
|
Jenkins, retest this please. |
|
Test build #89990 has finished for PR 21031 at commit
|
|
retest this please |
|
Test build #90017 has finished for PR 21031 at commit
|
|
retest this please |
|
Test build #90026 has finished for PR 21031 at commit
|
|
retest this please |
|
Test build #90046 has finished for PR 21031 at commit
|
|
retest this please |
|
Test build #90068 has finished for PR 21031 at commit
|
|
The SparkR test failure is not related. Thanks! Merged to master. |
What changes were proposed in this pull request?
The PR adds the SQL function
cardinality. The behavior of the function is based on Presto's one.The function returns the length of the array or map stored in the column as
intwhile the Presto version returns the value asBigInt(longin Spark). The discussions regarding the difference of return type are here and there.How was this patch tested?
Added UTs