Skip to content

Conversation

@icexelloss
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to support an alternative function from with group aggregate pandas UDF.

The current form:

def foo(pdf):
    return ...

Takes a single arg that is a pandas DataFrame.

With this PR, an alternative form is supported:

def foo(key, pdf):
    return ...

The alternative form takes two argument - a tuple that presents the grouping key, and a pandas DataFrame represents the data.

How was this patch tested?

GroupbyApplyTests

@icexelloss icexelloss changed the title [SPARK-23011] Support alternative function form with group aggregate pandas UDF wip: [SPARK-23011] Support alternative function form with group aggregate pandas UDF Jan 17, 2018
@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86286 has finished for PR 20295 at commit 38195ac.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@icexelloss icexelloss changed the title wip: [SPARK-23011] Support alternative function form with group aggregate pandas UDF [WIP][SPARK-23011] Support alternative function form with group aggregate pandas UDF Jan 17, 2018
@icexelloss
Copy link
Contributor Author

icexelloss commented Jan 17, 2018

cc @ueshin @HyukjinKwon @cloud-fan @viirya

This PR implements discussion here #20211 (review). There are more refinement needs to be done but I'd like to get some early feedback whether this approach looks good in general.

The general idea is to pass grouping columns as extra columns to the python worker and to use argOffsets to specify which columns are grouping columns. Finally, we convert a grouping columns to a single tuple before entering user function. This is slightly inefficient because grouping columns are sent twice. But I think this is OK because grouping columns should be relatively small comparing to the entire DataFrame.

I can also implement some kind of de duplicate logic in FlatMapGroupsInPandasExec, however that would require creating another UnsafeProjection, I am not sure if it's worth it performance wise.

WDYT?

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86290 has finished for PR 20295 at commit 7ce3fa7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

How are you going to send the group columns? For a group we have only one group row and a bunch of data rows.

@icexelloss
Copy link
Contributor Author

icexelloss commented Jan 18, 2018

@cloud-fan Currently I send group columns along with the extra data column. For example, if the original DataFrame has id, v and group column is id, the current implementation in this PR will send three series id, id, v to the python worker and send an argOffsets of [1, 2] to specify the data columns are id, v. The first value of the group column is used as the group key, because values in a group column are equal.

I implemented it this way because it doesn't change the existing serialization protocol. Alternatively, we can implement a new serialization protocol for GROUP_MAP eval type, i.e, instead of sending an arrow batch, we could send a group row and then an arrow batch. What do you think?

@cloud-fan
Copy link
Contributor

How do we turn a single group column to a series? just repeat the group column?

@icexelloss
Copy link
Contributor Author

Yep, that's correct.

@HyukjinKwon
Copy link
Member

To me, seems roughly fine.

Alternatively, we can implement a new serialization protocol for GROUP_MAP eval type, i.e, instead of sending an arrow batch, we could send a group row and then an arrow batch.

I don't have a strong preference on this.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also neutral basically, but I'd prefer the new serialization if there is a simple way and performant enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove the comment above (# NOTE: ...) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't know what the comment above means. @BryanCutler do you remember?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the note was because prev we iterated over Arrow columns and converted each to a Series, then changed to convert an Arrow batch to DataFrame and then iterated over DataFrame columns to get a Series. I wasn't sure if there might be a perf decrease, so I left the note but I'm not sure why it wasn't done like the above in the first place - seems like it would be just as good as the original. Anyway, yeah the note can be removed now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler Thanks for the clarification. I removed the note.

@icexelloss
Copy link
Contributor Author

Let me experiment with new serialization approach. Will update here.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds fine with me, I think that different serialization is slightly better to avoid duplicate data. Would the group Row be sent as a separate Arrow batch?

Regarding the API, I missed the original discussion, but just as an additional thought, a while back I proposed having an optional kwargs to each pandas_udf to deal with 0-param udfs. If we were to do that, the group Row could be placed in there and then there wouldn't need to be 2 types of signatures to allow for an optional key arg. I can see why it might be preferable to have an explicit key though, so it's up to you guys - just thought I'd mention this again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the note was because prev we iterated over Arrow columns and converted each to a Series, then changed to convert an Arrow batch to DataFrame and then iterated over DataFrame columns to get a Series. I wasn't sure if there might be a perf decrease, so I left the note but I'm not sure why it wasn't done like the above in the first place - seems like it would be just as good as the original. Anyway, yeah the note can be removed now.

@icexelloss icexelloss force-pushed the SPARK-23011-groupby-apply-key branch from c7fccde to 259edc5 Compare January 24, 2018 22:15
@icexelloss
Copy link
Contributor Author

Hi all,

I did some digging and I think adding a serialization form that serialize a key object along with a Arrow record batch is quite complicated because we are using ArrowStreamReader/Writer for sending batches and send extra key data would have to use a lower level Arrow API for sending/receiving batches.

I did two things to convince myself the current approach is fine:

  • I add logic to de duplicate grouping key they are already in data columns. i.e., if a user calls
df.groupby('id').apply(foo_udf)

We will not send extra grouping columns because those are already part of data columns. Instead, we will just use the corresponding data column to get grouping key to pass to user function. However, if user calls:

df.groupby(df.id % 2).apply(foo_udf)

then an extra column df.id % 2 will be created and sent to python worker. But I think this is an uncommon case.

  • I did some benchmark to see the impact of sending extra grouping column. I used a Spark DataFrame of a single column to maximize the effect of the extra grouping column (basically sending extra grouping column will double the data to be sent to python in the benchmark, however in real use cases the effect of sending extra grouping columns should be far less).
    Even with the setting of the benchmark, I have not observed performance diffs when sending extra grouping columns, I think this is because the total time is dominated by other parts of the computation. micro benchmark

I'd like to leave the work for more flexible arrow serialization as future work because it doesn't seems to affect performance of this patch and proceed with the current patch based on the two points above. What do you guys think?

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86606 has finished for PR 20295 at commit cda97f1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86604 has finished for PR 20295 at commit c7fccde.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86651 has finished for PR 20295 at commit 6b8fdbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the error message here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha good catch. Fixed.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 29, 2018

For #20295 (comment), I am fine without new serialization protocol actually. I didn't have a strong preference there because I wasn't sure if it's worth - complexity vs actual gain vaguely and seems that's now clarified there. I am okay with the current approach.

@BryanCutler, I think the intention here is to follow other few APIs and gapply in R. I guess you meant the length and metadata stuff by "an optional kwargs to each pandas_udf to deal with 0-param udfs" if I remember correctly. I think that's slightly different because here the motivation is to provide consistent support similarly with other APIs vs the kwargs thing sounds pretty new concept to me.

@icexelloss
Copy link
Contributor Author

@HyukjinKwon Thanks for the comment. I will continue with the current approach unless objection raises. I will work on comments and refinements in the next day or two.

@icexelloss icexelloss changed the title [WIP][SPARK-23011] Support alternative function form with group aggregate pandas UDF [SPARK-23011] Support alternative function form with group aggregate pandas UDF Jan 29, 2018
@icexelloss
Copy link
Contributor Author

@HyukjinKwon @ueshin This is ready for review. I addressed the comments so far.

@BryanCutler yeah I think kwargs is another option. But I think the API in this PR is more consistent with the exsiting APIs though.

@SparkQA
Copy link

SparkQA commented Jan 29, 2018

Test build #86773 has finished for PR 20295 at commit edb77dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@icexelloss icexelloss force-pushed the SPARK-23011-groupby-apply-key branch from edb77dc to 2668251 Compare January 30, 2018 15:35
@icexelloss
Copy link
Contributor Author

Rebased

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86834 has finished for PR 20295 at commit 2668251.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86836 has finished for PR 20295 at commit 2399b77.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86837 has finished for PR 20295 at commit 8f0782c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

@BryanCutler yeah I think kwargs is another option. But I think the API in this PR is more consistent with the exsiting APIs though.

Yeah if it's consistent with other APIs then sounds fine with me. My concern was in giving the user too many options that it starts to get confusing to make UDFs. If it's a familiar API then that probably won't be the case.

@icexelloss icexelloss force-pushed the SPARK-23011-groupby-apply-key branch from 9ed3779 to 722ed50 Compare March 5, 2018 16:39
@icexelloss
Copy link
Contributor Author

Addressed all comments and manually tested the example in docstring.

@SparkQA
Copy link

SparkQA commented Mar 5, 2018

Test build #87968 has finished for PR 20295 at commit 722ed50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Will merge this one if there's no more comments or not merged within few days.

def wrapped(*series):
def wrapped(key_series, value_series):
import pandas as pd
argspec = inspect.getargspec(f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also do getfullargspec for py3 like in udf.py?
maybe it would be useful to put a function in util.py, what do you guys think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let me take a look at that.

@ueshin
Copy link
Member

ueshin commented Mar 6, 2018

LGTM except for @BryanCutler's suggestion (#20295 (comment)). Thanks!

@ueshin
Copy link
Member

ueshin commented Mar 6, 2018

@icexelloss Could you annotate [SQL][PYTHON] in the pr title please?

@icexelloss icexelloss changed the title [SPARK-23011] Support alternative function form with group aggregate pandas UDF [SPARK-23011][SQL][PYTHON] Support alternative function form with group aggregate pandas UDF Mar 6, 2018
@SparkQA
Copy link

SparkQA commented Mar 6, 2018

Test build #88020 has finished for PR 20295 at commit c74ed05.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@icexelloss
Copy link
Contributor Author

retest this please

sc.pythonVer, broadcast_vars, sc._javaAccumulator)


def _get_argspec(f):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about putting this in pyspark.util? It might be useful in places other than sql

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Moved to pyspark.util.

@SparkQA
Copy link

SparkQA commented Mar 7, 2018

Test build #88025 has finished for PR 20295 at commit c74ed05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2018

Test build #88049 has finished for PR 20295 at commit d51bc2e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2018

Test build #88050 has finished for PR 20295 at commit 4b61f52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2018

Test build #88048 has finished for PR 20295 at commit 59bdf20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

BTW, let's don't forget to fix the doc later ..

@asfgit asfgit closed this in 2cb23a8 Mar 8, 2018
@icexelloss
Copy link
Contributor Author

Thanks all for review! @HyukjinKwon do you mean this doc?
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pyspark-usage-guide-for-pandas-with-apache-arrow

I can update it now or we can update later in batch before 2.4 release. What do you prefer?

@HyukjinKwon
Copy link
Member

Yup. Maybe we could do that when we are close to 2.4.

@icexelloss
Copy link
Contributor Author

Sounds good. Let's track in https://issues.apache.org/jira/browse/SPARK-23633

self.assertPandasEqual(expected2, result2)

# Test complex groupby
result3 = df.groupby(df.id, df.v % 2).apply(udf2).sort('id', 'v').toPandas()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any negative test case when the number of columns specified in groupby is different from the definition of udf (foo2)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For end users, the misuse of this alternative functions could be common. For example, do we issue an appropriate error in the following cases?

  • result3 = df.groupby(df.id).apply(udf2).sort('id', 'v').toPandas()
  • result3 = df.groupby(df.id, df.v % 2, df.id).apply(udf2).sort('id', 'v').toPandas()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, any error in this case will be thrown as is from worker.py side which is read and redirect to users end via JVM. For instance:

from pyspark.sql.functions import pandas_udf, PandasUDFType
def test_func(key, pdf):
    assert len(key) == 0
    return pdf

udf1 = pandas_udf(test_func, "id long, v1 double", PandasUDFType.GROUPED_MAP)
spark.range(10).groupby('id').apply(udf1).sort('id').show()
18/09/04 14:22:52 ERROR TaskSetManager: Task 1 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 378, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o68.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 353, in main
    process()
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 348, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 242, in <lambda>
    func = lambda _, it: map(mapper, it)
  File "<string>", line 1, in <lambda>
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 110, in wrapped
    result = f(key, pd.concat(value_series, axis=1))
  File "/.../spark/python/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 2, in test_func
AssertionError

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:418)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:372)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
	at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:628)
	at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
	at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1427)
	at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1424)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:48)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:128)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1822)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1810)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1809)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1809)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2043)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1992)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1981)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
	at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1029)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.reduce(RDD.scala:1011)
	at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1433)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1420)
	at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 353, in main
    process()
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 348, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 242, in <lambda>
    func = lambda _, it: map(mapper, it)
  File "<string>", line 1, in <lambda>
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 110, in wrapped
    result = f(key, pd.concat(value_series, axis=1))
  File "/.../spark/python/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 2, in test_func
AssertionError

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:418)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:372)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
	at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:628)
	at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
	at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1427)
	at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1424)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:48)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:128)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants