[SPARK-23011][SQL][PYTHON] Support alternative function form with group aggregate pandas UDF #20295

icexelloss · 2018-01-17T18:43:21Z

What changes were proposed in this pull request?

This PR proposes to support an alternative function from with group aggregate pandas UDF.

The current form:

def foo(pdf):
    return ...

Takes a single arg that is a pandas DataFrame.

With this PR, an alternative form is supported:

def foo(key, pdf):
    return ...

The alternative form takes two argument - a tuple that presents the grouping key, and a pandas DataFrame represents the data.

How was this patch tested?

GroupbyApplyTests

SparkQA · 2018-01-17T18:49:51Z

Test build #86286 has finished for PR 20295 at commit 38195ac.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-01-17T20:00:36Z

cc @ueshin @HyukjinKwon @cloud-fan @viirya

This PR implements discussion here #20211 (review). There are more refinement needs to be done but I'd like to get some early feedback whether this approach looks good in general.

The general idea is to pass grouping columns as extra columns to the python worker and to use argOffsets to specify which columns are grouping columns. Finally, we convert a grouping columns to a single tuple before entering user function. This is slightly inefficient because grouping columns are sent twice. But I think this is OK because grouping columns should be relatively small comparing to the entire DataFrame.

I can also implement some kind of de duplicate logic in FlatMapGroupsInPandasExec, however that would require creating another UnsafeProjection, I am not sure if it's worth it performance wise.

WDYT?

SparkQA · 2018-01-17T22:29:43Z

Test build #86290 has finished for PR 20295 at commit 7ce3fa7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-18T03:14:07Z

How are you going to send the group columns? For a group we have only one group row and a bunch of data rows.

icexelloss · 2018-01-18T14:56:13Z

@cloud-fan Currently I send group columns along with the extra data column. For example, if the original DataFrame has id, v and group column is id, the current implementation in this PR will send three series id, id, v to the python worker and send an argOffsets of [1, 2] to specify the data columns are id, v. The first value of the group column is used as the group key, because values in a group column are equal.

I implemented it this way because it doesn't change the existing serialization protocol. Alternatively, we can implement a new serialization protocol for GROUP_MAP eval type, i.e, instead of sending an arrow batch, we could send a group row and then an arrow batch. What do you think?

cloud-fan · 2018-01-19T07:41:04Z

How do we turn a single group column to a series? just repeat the group column?

icexelloss · 2018-01-19T14:17:24Z

Yep, that's correct.

HyukjinKwon · 2018-01-20T14:30:37Z

To me, seems roughly fine.

Alternatively, we can implement a new serialization protocol for GROUP_MAP eval type, i.e, instead of sending an arrow batch, we could send a group row and then an arrow batch.

I don't have a strong preference on this.

ueshin

I'm also neutral basically, but I'd prefer the new serialization if there is a simple way and performant enough.

ueshin · 2018-01-22T11:38:58Z

python/pyspark/serializers.py

Maybe we can remove the comment above (# NOTE: ...) ?

I actually don't know what the comment above means. @BryanCutler do you remember?

Yeah, the note was because prev we iterated over Arrow columns and converted each to a Series, then changed to convert an Arrow batch to DataFrame and then iterated over DataFrame columns to get a Series. I wasn't sure if there might be a perf decrease, so I left the note but I'm not sure why it wasn't done like the above in the first place - seems like it would be just as good as the original. Anyway, yeah the note can be removed now.

@BryanCutler Thanks for the clarification. I removed the note.

icexelloss · 2018-01-22T16:04:06Z

Let me experiment with new serialization approach. Will update here.

BryanCutler

This sounds fine with me, I think that different serialization is slightly better to avoid duplicate data. Would the group Row be sent as a separate Arrow batch?

Regarding the API, I missed the original discussion, but just as an additional thought, a while back I proposed having an optional kwargs to each pandas_udf to deal with 0-param udfs. If we were to do that, the group Row could be placed in there and then there wouldn't need to be 2 types of signatures to allow for an optional key arg. I can see why it might be preferable to have an explicit key though, so it's up to you guys - just thought I'd mention this again.

BryanCutler · 2018-01-22T22:00:10Z

python/pyspark/serializers.py

Yeah, the note was because prev we iterated over Arrow columns and converted each to a Series, then changed to convert an Arrow batch to DataFrame and then iterated over DataFrame columns to get a Series. I wasn't sure if there might be a perf decrease, so I left the note but I'm not sure why it wasn't done like the above in the first place - seems like it would be just as good as the original. Anyway, yeah the note can be removed now.

icexelloss · 2018-01-24T22:34:41Z

Hi all,

I did some digging and I think adding a serialization form that serialize a key object along with a Arrow record batch is quite complicated because we are using ArrowStreamReader/Writer for sending batches and send extra key data would have to use a lower level Arrow API for sending/receiving batches.

I did two things to convince myself the current approach is fine:

I add logic to de duplicate grouping key they are already in data columns. i.e., if a user calls

df.groupby('id').apply(foo_udf)

We will not send extra grouping columns because those are already part of data columns. Instead, we will just use the corresponding data column to get grouping key to pass to user function. However, if user calls:

df.groupby(df.id % 2).apply(foo_udf)

then an extra column df.id % 2 will be created and sent to python worker. But I think this is an uncommon case.

I did some benchmark to see the impact of sending extra grouping column. I used a Spark DataFrame of a single column to maximize the effect of the extra grouping column (basically sending extra grouping column will double the data to be sent to python in the benchmark, however in real use cases the effect of sending extra grouping columns should be far less).
Even with the setting of the benchmark, I have not observed performance diffs when sending extra grouping columns, I think this is because the total time is dominated by other parts of the computation. micro benchmark

I'd like to leave the work for more flexible arrow serialization as future work because it doesn't seems to affect performance of this patch and proceed with the current patch based on the two points above. What do you guys think?

SparkQA · 2018-01-25T01:17:45Z

Test build #86606 has finished for PR 20295 at commit cda97f1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-25T01:50:22Z

Test build #86604 has finished for PR 20295 at commit c7fccde.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-01-25T22:02:45Z

Test build #86651 has finished for PR 20295 at commit 6b8fdbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-01-29T07:04:18Z

python/pyspark/sql/udf.py

We should update the error message here.

Aha good catch. Fixed.

HyukjinKwon · 2018-01-29T14:19:24Z

For #20295 (comment), I am fine without new serialization protocol actually. I didn't have a strong preference there because I wasn't sure if it's worth - complexity vs actual gain vaguely and seems that's now clarified there. I am okay with the current approach.

@BryanCutler, I think the intention here is to follow other few APIs and gapply in R. I guess you meant the length and metadata stuff by "an optional kwargs to each pandas_udf to deal with 0-param udfs" if I remember correctly. I think that's slightly different because here the motivation is to provide consistent support similarly with other APIs vs the kwargs thing sounds pretty new concept to me.

icexelloss · 2018-01-29T15:57:48Z

@HyukjinKwon Thanks for the comment. I will continue with the current approach unless objection raises. I will work on comments and refinements in the next day or two.

icexelloss · 2018-01-29T16:30:00Z

@HyukjinKwon @ueshin This is ready for review. I addressed the comments so far.

@BryanCutler yeah I think kwargs is another option. But I think the API in this PR is more consistent with the exsiting APIs though.

SparkQA · 2018-01-29T19:49:25Z

Test build #86773 has finished for PR 20295 at commit edb77dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-01-30T15:38:48Z

Rebased

SparkQA · 2018-01-30T15:40:37Z

Test build #86834 has finished for PR 20295 at commit 2668251.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-30T16:25:27Z

Test build #86836 has finished for PR 20295 at commit 2399b77.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-30T20:10:12Z

Test build #86837 has finished for PR 20295 at commit 8f0782c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-01-30T21:49:05Z

@BryanCutler yeah I think kwargs is another option. But I think the API in this PR is more consistent with the exsiting APIs though.

Yeah if it's consistent with other APIs then sounds fine with me. My concern was in giving the user too many options that it starts to get confusing to make UDFs. If it's a familiar API then that probably won't be the case.

icexelloss · 2018-03-05T16:41:43Z

Addressed all comments and manually tested the example in docstring.

SparkQA · 2018-03-05T19:56:25Z

Test build #87968 has finished for PR 20295 at commit 722ed50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-05T23:56:11Z

Will merge this one if there's no more comments or not merged within few days.

BryanCutler · 2018-03-06T00:27:31Z

python/pyspark/worker.py

-    def wrapped(*series):
+    def wrapped(key_series, value_series):
        import pandas as pd
+        argspec = inspect.getargspec(f)


should this also do getfullargspec for py3 like in udf.py?
maybe it would be useful to put a function in util.py, what do you guys think?

Good point. Let me take a look at that.

ueshin · 2018-03-06T04:21:30Z

LGTM except for @BryanCutler's suggestion (#20295 (comment)). Thanks!

ueshin · 2018-03-06T04:38:39Z

@icexelloss Could you annotate [SQL][PYTHON] in the pr title please?

SparkQA · 2018-03-06T20:53:17Z

Test build #88020 has finished for PR 20295 at commit c74ed05.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-03-06T21:05:37Z

retest this please

BryanCutler · 2018-03-06T23:46:52Z

python/pyspark/sql/udf.py

                                  sc.pythonVer, broadcast_vars, sc._javaAccumulator)


+def _get_argspec(f):


How about putting this in pyspark.util? It might be useful in places other than sql

Make sense. Moved to pyspark.util.

SparkQA · 2018-03-07T00:23:29Z

Test build #88025 has finished for PR 20295 at commit c74ed05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-07T16:54:59Z

Test build #88049 has finished for PR 20295 at commit d51bc2e.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-07T20:11:58Z

Test build #88050 has finished for PR 20295 at commit 4b61f52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-07T20:13:07Z

Test build #88048 has finished for PR 20295 at commit 59bdf20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-08T11:29:21Z

Merged to master.

HyukjinKwon · 2018-03-08T11:31:23Z

BTW, let's don't forget to fix the doc later ..

icexelloss · 2018-03-08T16:21:43Z

Thanks all for review! @HyukjinKwon do you mean this doc?
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pyspark-usage-guide-for-pandas-with-apache-arrow

I can update it now or we can update later in batch before 2.4 release. What do you prefer?

HyukjinKwon · 2018-03-08T22:24:33Z

Yup. Maybe we could do that when we are close to 2.4.

icexelloss · 2018-03-08T22:29:15Z

Sounds good. Let's track in https://issues.apache.org/jira/browse/SPARK-23633

gatorsmile · 2018-09-04T06:13:13Z

python/pyspark/sql/tests.py

+        self.assertPandasEqual(expected2, result2)
+
+        # Test complex groupby
+        result3 = df.groupby(df.id, df.v % 2).apply(udf2).sort('id', 'v').toPandas()


Any negative test case when the number of columns specified in groupby is different from the definition of udf (foo2)?

For end users, the misuse of this alternative functions could be common. For example, do we issue an appropriate error in the following cases?

result3 = df.groupby(df.id).apply(udf2).sort('id', 'v').toPandas()

result3 = df.groupby(df.id, df.v % 2, df.id).apply(udf2).sort('id', 'v').toPandas()

In that case, any error in this case will be thrown as is from worker.py side which is read and redirect to users end via JVM. For instance:

from pyspark.sql.functions import pandas_udf, PandasUDFType def test_func(key, pdf): assert len(key) == 0 return pdf udf1 = pandas_udf(test_func, "id long, v1 double", PandasUDFType.GROUPED_MAP) spark.range(10).groupby('id').apply(udf1).sort('id').show()

18/09/04 14:22:52 ERROR TaskSetManager: Task 1 in stage 0.0 failed 1 times; aborting job Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 378, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o68.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 353, in main process() File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 348, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 242, in <lambda> func = lambda _, it: map(mapper, it) File "<string>", line 1, in <lambda> File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 110, in wrapped result = f(key, pd.concat(value_series, axis=1)) File "/.../spark/python/pyspark/util.py", line 99, in wrapper return f(*args, **kwargs) File "<stdin>", line 2, in test_func AssertionError at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:418) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:372) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:628) at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1427) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1424) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:48) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:128) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1822) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1810) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1809) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1809) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2043) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1992) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1981) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1029) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1433) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1420) at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545) at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at org.apache.spark.sql.Dataset.head(Dataset.scala:2545) at org.apache.spark.sql.Dataset.take(Dataset.scala:2759) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255) at org.apache.spark.sql.Dataset.showString(Dataset.scala:292) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 353, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 348, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 242, in <lambda> func = lambda _, it: map(mapper, it) File "<string>", line 1, in <lambda> File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 110, in wrapped result = f(key, pd.concat(value_series, axis=1)) File "/.../spark/python/pyspark/util.py", line 99, in wrapper return f(*args, **kwargs) File "<stdin>", line 2, in test_func AssertionError at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:418) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:372) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:628) at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1427) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1424) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:48) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:128) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more

icexelloss changed the title ~~[SPARK-23011] Support alternative function form with group aggregate pandas UDF~~ wip: [SPARK-23011] Support alternative function form with group aggregate pandas UDF Jan 17, 2018

icexelloss changed the title ~~wip: [SPARK-23011] Support alternative function form with group aggregate pandas UDF~~ [WIP][SPARK-23011] Support alternative function form with group aggregate pandas UDF Jan 17, 2018

ueshin reviewed Jan 22, 2018

View reviewed changes

BryanCutler reviewed Jan 22, 2018

View reviewed changes

icexelloss force-pushed the SPARK-23011-groupby-apply-key branch from c7fccde to 259edc5 Compare January 24, 2018 22:15

ueshin reviewed Jan 29, 2018

View reviewed changes

icexelloss changed the title ~~[WIP][SPARK-23011] Support alternative function form with group aggregate pandas UDF~~ [SPARK-23011] Support alternative function form with group aggregate pandas UDF Jan 29, 2018

icexelloss force-pushed the SPARK-23011-groupby-apply-key branch from edb77dc to 2668251 Compare January 30, 2018 15:35

icexelloss added 3 commits March 5, 2018 11:30

Add docstring

5740006

Fix tests

e4e5921

Address PR comments

722ed50

icexelloss force-pushed the SPARK-23011-groupby-apply-key branch from 9ed3779 to 722ed50 Compare March 5, 2018 16:39

BryanCutler reviewed Mar 6, 2018

View reviewed changes

refactor _get_argspec

c74ed05

icexelloss changed the title ~~[SPARK-23011] Support alternative function form with group aggregate pandas UDF~~ [SPARK-23011][SQL][PYTHON] Support alternative function form with group aggregate pandas UDF Mar 6, 2018

BryanCutler reviewed Mar 6, 2018

View reviewed changes

icexelloss added 2 commits March 7, 2018 11:45

Move get_argspec to utils.py

59bdf20

Move from pyspark.sql.utils to pyspark.util

d51bc2e

Fix style

4b61f52

asfgit closed this in 2cb23a8 Mar 8, 2018

gatorsmile reviewed Sep 4, 2018

View reviewed changes

		sc.pythonVer, broadcast_vars, sc._javaAccumulator)


		def _get_argspec(f):

[SPARK-23011][SQL][PYTHON] Support alternative function form with group aggregate pandas UDF #20295

[SPARK-23011][SQL][PYTHON] Support alternative function form with group aggregate pandas UDF #20295

Uh oh!

Conversation

icexelloss commented Jan 17, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 17, 2018

Uh oh!

icexelloss commented Jan 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 17, 2018

Uh oh!

cloud-fan commented Jan 18, 2018

Uh oh!

icexelloss commented Jan 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jan 19, 2018

Uh oh!

icexelloss commented Jan 19, 2018

Uh oh!

HyukjinKwon commented Jan 20, 2018

Uh oh!

ueshin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Jan 22, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Jan 24, 2018

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icexelloss commented Jan 29, 2018

Uh oh!

icexelloss commented Jan 29, 2018

Uh oh!

SparkQA commented Jan 29, 2018

Uh oh!

icexelloss commented Jan 30, 2018

Uh oh!

SparkQA commented Jan 30, 2018

Uh oh!

SparkQA commented Jan 30, 2018

Uh oh!

SparkQA commented Jan 30, 2018

Uh oh!

BryanCutler commented Jan 30, 2018

Uh oh!

icexelloss commented Mar 5, 2018

Uh oh!

SparkQA commented Mar 5, 2018

Uh oh!

icexelloss commented Jan 17, 2018 •

edited

Loading

icexelloss commented Jan 18, 2018 •

edited

Loading

ueshin left a comment •

edited

Loading

HyukjinKwon commented Jan 29, 2018 •

edited

Loading