-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22239][SQL][Python] Enable grouped aggregate pandas UDFs as window functions with unbounded window frames #21082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6a964d4 to
9fdcfe6
Compare
|
Test build #89416 has finished for PR 21082 at commit
|
|
Test build #89462 has finished for PR 21082 at commit
|
|
Test build #89584 has finished for PR 21082 at commit
|
|
Test build #89586 has finished for PR 21082 at commit
|
|
cc @BryanCutler @ueshin @HyukjinKwon @viirya cc @yhuai because of window related changes. This PR is ready for review now |
|
Test build #89595 has finished for PR 21082 at commit
|
|
Test build #89597 has finished for PR 21082 at commit
|
|
retest this please |
|
From a very quick look, the flakiness looks global. |
|
Test build #89612 has finished for PR 21082 at commit
|
|
retest this please. |
|
Test build #89616 has finished for PR 21082 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because an early analysis exception is thrown by rule ExtractWindowExpressions
python/pyspark/sql/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated, but I figured its shouldn't hurt to add an array test in GroupedAggPandasUDFTests..
|
Test build #89682 has finished for PR 21082 at commit
|
|
Test build #89693 has finished for PR 21082 at commit
|
|
Hey @HyukjinKwon @ueshin @BryanCutler I've fixed the tests and I think the PR is in good shape for review now. Could you please take a look when you have time? Thanks! |
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:class:`pyspark.sql.Window`?
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: shows.
|
Will take a close look soon within this weekend as well. |
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we don't have PandasUDFType.WINDOW_AGG and a pandas udf defined as PandasUDFType.GROUPED_AGG can be both used with groupby and Window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. The idea is that the producer of the UDF can produce a grouped agg udf, such as weighted mean, and the consumer can use the UDF in both groupby and window, similar to how SQL aggregation function work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this analysis check in Analyzer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to do this in Analyzer, then we would carry the WindowFunctionType in the logical plan.
I did it this way to avoid changing the logical node. I am open to add WindowFunctionType to the logical plan though. What do other people think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW:
.reduceLeft {
...
}
)
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icexelloss, actually should we keep this note? I think this is matched with https://spark.apache.org/docs/latest/sql-programming-guide.html#supported-sql-types which we documented there and SQLConf.
Probably, just leaving a link could be fine. Removing out is okay to me too. I think just adding a note for all the Pandas udfs works too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning towards keeping this in the API doc and maybe make sql-programming-guide link to this.
I think most user would look for API docs first rather than sql-programming-guide, so it's probably a bit more convenient to have it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I think that works too. I left a comment only because it looked mismatched with this api doc and the sql programming guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would do
else {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inlined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p{ -> p {
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
but just for clarification, @icexelloss, do you have a WIP work or plan to support bounded ones too?
|
Test build #91499 has finished for PR 21082 at commit
|
|
@HyukjinKwon Thanks for the review! I will address the comments shortly. And yes, I will work on bounded windows on top of this PR. |
|
Thanks @HyukjinKwon for your review! @ueshin Do you want to take another look too? |
|
Yea, let's leave this open for few more days in case someone has more comments. |
|
Test build #91574 has finished for PR 21082 at commit
|
|
LGTM. |
|
@icexelloss, mind resolving the conflict? |
…sion after optmizition stage
6350408 to
328b2c4
Compare
|
Test build #91718 has finished for PR 21082 at commit
|
|
Merged to master. |
|
Thanks everyone for the review! |
|
Hello! this is great work! Thank you for contributing. This code will enable to run functions on window, which take in pd.Series -> Any. I am wondering if GROUPED_MAP pandas UDF as window functions is also in pipeline or not? This could simplified to: |
What changes were proposed in this pull request?
This PR enables using a grouped aggregate pandas UDFs as window functions. The semantics is the same as using SQL aggregation function as window functions.
The scope of this PR is somewhat limited in terms of:
(1) Only supports unbounded window, which acts essentially as group by.
(2) Only supports aggregation functions, not "transform" like window functions (n -> n mapping)
Both of these are left as future work. Especially, (1) needs careful thinking w.r.t. how to pass rolling window data to python efficiently. (2) is a bit easier but does require more changes therefore I think it's better to leave it as a separate PR.
How was this patch tested?
WindowPandasUDFTests