[SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-only UDTs #13778

viirya · 2016-06-20T06:57:15Z

What changes were proposed in this pull request?

There are two related bugs of Python-only UDTs. Because the test case of second one needs the first fix too. I put them into one PR. If it is not appropriate, please let me know.

First bug: When MapObjects works on Python-only UDTs

RowEncoder will use PythonUserDefinedType.sqlType for its deserializer expression. If the sql type is ArrayType, we will have MapObjects working on it. But MapObjects doesn't consider PythonUserDefinedType as its input data type. It causes error like:

import pyspark.sql.group
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql.types import *

schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT())
df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema)
df.show()

File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
: java.lang.RuntimeException: Error while decoding: scala.MatchError: org.apache.spark.sql.types.PythonUserDefinedType@f4ceede8 (of class org.apache.spark.sql.types.PythonUserDefinedType)
...

Second bug: When Python-only UDTs is the element type of ArrayType

import pyspark.sql.group
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql.types import *

schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT()))
df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), float(i))]) for i in range(10)], schema=schema)
df.show()

How was this patch tested?

PySpark's sql tests.

SparkQA · 2016-06-20T07:04:30Z

Test build #60834 has finished for PR 13778 at commit f26c8dc.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

…ayType.

SparkQA · 2016-06-20T08:55:08Z

Test build #60835 has finished for PR 13778 at commit cd80f0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-06-20T08:59:58Z

cc @davies

SparkQA · 2016-06-20T09:26:25Z

Test build #60837 has finished for PR 13778 at commit d22dca8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-06-20T14:25:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala

-  private def deserializerFor(input: Expression): Expression = input.dataType match {
+  private def deserializerFor(input: Expression): Expression = {
+    deserializerFor(input, input.dataType)
+  }


Seems that this method is never used?

uh? It is the original deserializerFor method and is used below and above.

Oh, sorry... Confused by the split diff view...

vlad17 · 2016-06-20T20:21:22Z

Here's an unresolved example: https://gist.github.com/vlad17/2db8e14972344c693e8a3f03d91c9c8d

vlad17 · 2016-06-20T20:39:20Z

Update: looks like the above is just an issue with the __str__ method of udf-returned UDTs, which is a different bug (a bug that's also pretty harmless).

vlad17 · 2016-06-20T20:47:45Z

Another update: https://gist.github.com/vlad17/cfcd42f30ea2380df4fb0bfa30dda7ce unresolved
(issue is that inputDataType needs to be a recursive function that unrolls inputData.dataType and extracts all python udt)

viirya · 2016-06-20T23:19:16Z

@vlad17 Thanks! I will look into that issue.

mengxr · 2016-06-21T07:55:36Z

@viirya Do we need to fix this in Spark 2.0? UDTs are private APIs and the only intended use case is Vector/Matrix UDTs for MLlib, which doesn't put vectors or matrices inside an array inside a pipeline. In Spark 2.1, we probably need a formal discussion on merging UDT into Encoder, which could completely change its implementation.

viirya · 2016-06-21T08:56:14Z

@mengxr Although UDTs are private APIs, but as you see from the example, the users can define user classes and corresponding UDTs in Python that will be PythonUserDefinedType. The issues in this PR are not rare cases and it is very possibly PySpark users will hit them during using 2.0.

SparkQA · 2016-06-21T09:37:44Z

Test build #60916 has finished for PR 13778 at commit fc9c106.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vlad17 · 2016-06-21T15:41:46Z

python/pyspark/sql/tests.py

+        df = self.spark.createDataFrame(
+            [(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)],
+            schema=schema)
+        df.show()


DataFrame.show() gives unnecessary stringification, so this test ends up testing unnecessary stuff (in fact it would fail if the UDT didn't have __str__. I would use collect() to force materialization instead.

This test only fails when using show() as I mentioned on the JIRA SPARK-16062.

SparkQA · 2016-06-22T03:14:54Z

Test build #60996 has finished for PR 13778 at commit d603cc2.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-22T05:49:11Z

Test build #61000 has finished for PR 13778 at commit a0b81ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-06-24T02:38:29Z

ping @vlad17 @davies @liancheng Any thing else?

viirya · 2016-06-24T06:48:15Z

also cc @yhuai @cloud-fan

viirya · 2016-06-28T15:20:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

      case _ => ""
    }

+    val inputDT = inputDataType.getOrElse(inputData.dataType)


@cloud-fan I think there is no way to easily catch python udt before MapObjects. The approach I use now is to pass a datatype (python udt's sqlType) to MapObjects.

viirya · 2016-07-06T09:20:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

    val loopIsNull = "MapObjects_loopIsNull" + curId.getAndIncrement()
    val loopVar = LambdaVariable(loopValue, loopIsNull, elementType)
-    MapObjects(loopValue, loopIsNull, elementType, function(loopVar), inputData)
+    MapObjects(loopValue, loopIsNull, elementType, function(loopVar), inputData, None)


It is possibly that inputData is unresolved yet. We can't just pass in the data type of inputData. So I still make inputDataType as Option[DataType] below.

SparkQA · 2016-07-06T11:27:52Z

Test build #61836 has finished for PR 13778 at commit 1b751af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vlad17 · 2016-07-06T15:56:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

 * @param lambdaFunction A function that take the `loopVar` as input, and used as lambda function
 *                       to handle collection elements.
 * @param inputData An expression that when evaluated returns a collection object.
+ * @param inputDataType The dataType of inputData.


Document that it's an optional and say default behavior is to use the resolved .dataType of inputData by default.

SparkQA · 2016-07-07T09:49:35Z

Test build #61903 has finished for PR 13778 at commit 87a0953.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-07-07T09:55:12Z

ping @cloud-fan @vlad17 Any thing else?

vlad17 · 2016-07-07T15:49:41Z

LGTM +1

cloud-fan · 2016-07-07T16:19:53Z

From another point of view, is it necessary to propagate the python UDF from python side to jvm side? IIUC the serialization of python UDT happens at python side, and the jvm side can only see binary for python data, there is nothing we can do at java side. Correct me if I am wrong, thanks.

viirya · 2016-07-08T03:16:46Z

Python UDT in python side only serializes the python data to sql type defined in the Python UDT. The problem now is happened at the serialization to row in java side on the serialized python data. I think it can not be certain that the serialized python data doesn't need the serialization in java side.

viirya · 2016-07-10T08:23:10Z

ping @cloud-fan any more concern?

cloud-fan · 2016-07-10T09:31:02Z

Can you point out where we catch PythonUserDefinedType and do special serialization at java side? It looks to me that we just get its corresponding sql type.

viirya · 2016-07-10T11:59:05Z

Oh, I mean they should be serialized/deserialized by pickler. So I think the jvm side doesn't just see binary for python data. It is already processed by picker. That is why we can process these data in encoder/decoder now with its sql type.

cloud-fan · 2016-07-10T15:57:37Z

yea, so whatever the data type is(python udt or normal sql type), at java side there is no difference, the data is converted to corrected format by pickler. That's why I think maybe it's possible to just pass the corresponding sql type of python udt to java side.

My only concern is, sometimes we use the schema of java dataframe as the schema at python side. If we don't pass python udt to java side, the udt information will be lost. @viirya do you mind give it a try? thanks!

viirya · 2016-07-11T02:33:45Z

@cloud-fan I just checked the python UDT. In python side, we will serialize the python UDT to binary. The python UDT passed to java includes the binary. Then in python side, in the worker we will deserialize the binary back to python UDT and use it for sql data serialization.

Because that, I think we can't just pass the sql type of python UDT to java side. What you think?

cloud-fan · 2016-07-11T02:50:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+   *                      expression will apply MapObjects on it. However, as the data type
+   *                      of inputData is Python UDT, which is not an expected array type
+   *                      in MapObjects. In this case, we need to explicitly use
+   *                      Python UDT's sqlType as data type.


As we have to mention python udt in MapObjects anyway, I think it makes more sense to add the python udt handling in MapOjects directly.

Do you mean the early commit? I remember it is the first approach I take.

But I think it exposes python udt to MapObjects?

But now we expose too. Readers have to know about python udt to understand this code.

Hmm, ok. Let me update it.

If we can hide python udt from MapObjects entirely, it worth to do. But looks like we can't, and I think then it makes more sense to expose python udt more explicitly.

Making sense. I will update it later.

viirya · 2016-07-12T06:49:36Z

@cloud-fan Updated. Please take a look. Thanks.

SparkQA · 2016-07-12T08:43:36Z

Test build #62151 has finished for PR 13778 at commit 6065364.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-07-13T06:21:35Z

ping @cloud-fan Please see if this is ok for you now. Thanks.

viirya · 2016-07-15T14:15:05Z

ping @cloud-fan Can you review this? Thanks.

viirya · 2016-07-18T14:50:06Z

ping @liancheng @yhuai Maybe you can review this too?

viirya · 2016-07-19T05:53:12Z

ping @cloud-fan Can you check if this is good for you now? It is for a while. Thanks.

viirya · 2016-07-22T07:36:29Z

ping @cloud-fan What do you think about this? Can we merge it now? Thanks.

viirya · 2016-07-25T06:22:20Z

ping @cloud-fan Do you miss this? Or you have other concern? Please let me know. Thanks.

viirya · 2016-08-02T01:11:06Z

ping @cloud-fan again, this is waiting for a while. Do you have time to look at again? Thanks.

davies · 2016-08-02T17:05:48Z

LGTM, merging this into master and 2.0 branch, thanks!

## What changes were proposed in this pull request? There are two related bugs of Python-only UDTs. Because the test case of second one needs the first fix too. I put them into one PR. If it is not appropriate, please let me know. ### First bug: When MapObjects works on Python-only UDTs `RowEncoder` will use `PythonUserDefinedType.sqlType` for its deserializer expression. If the sql type is `ArrayType`, we will have `MapObjects` working on it. But `MapObjects` doesn't consider `PythonUserDefinedType` as its input data type. It causes error like: import pyspark.sql.group from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql.types import * schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT()) df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema) df.show() File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString. : java.lang.RuntimeException: Error while decoding: scala.MatchError: org.apache.spark.sql.types.PythonUserDefinedTypef4ceede8 (of class org.apache.spark.sql.types.PythonUserDefinedType) ... ### Second bug: When Python-only UDTs is the element type of ArrayType import pyspark.sql.group from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql.types import * schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT())) df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), float(i))]) for i in range(10)], schema=schema) df.show() ## How was this patch tested? PySpark's sql tests. Author: Liang-Chi Hsieh <[email protected]> Closes #13778 from viirya/fix-pyudt. (cherry picked from commit 146001a) Signed-off-by: Davies Liu <[email protected]>

Fix bug of Python-only UDTs when MapObjects works on it.

f26c8dc

viirya added 2 commits June 20, 2016 15:14

Fix python style.

cd80f0e

Fix the bug of Python-only UDTs when it is used as the element of Arr…

d22dca8

…ayType.

viirya changed the title ~~[SPARK-16062][SQL] Fix bug of Python-only UDTs when MapObjects works on it~~ [SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-only UDTs Jun 20, 2016

liancheng reviewed Jun 20, 2016
View reviewed changes

Fix another issue.

fc9c106

vlad17 reviewed Jun 21, 2016
View reviewed changes

Create new unit tests.

d603cc2

Fix python style.

a0b81ba

Avoid exposing Python udt to MapObjects.

4c00bb1

viirya reviewed Jun 28, 2016
View reviewed changes

Fix test.

1b751af

viirya reviewed Jul 6, 2016
View reviewed changes

vlad17 reviewed Jul 6, 2016
View reviewed changes

Address comment.

87a0953

cloud-fan reviewed Jul 11, 2016
View reviewed changes

Address comments.

6065364

asfgit closed this in 146001a Aug 2, 2016

viirya deleted the fix-pyudt branch December 27, 2023 18:33

[SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-only UDTs #13778

[SPARK-16062][SPARK-15989][SQL] Fix two bugs of Python-only UDTs #13778

Uh oh!

Conversation

viirya commented Jun 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

First bug: When MapObjects works on Python-only UDTs

Second bug: When Python-only UDTs is the element type of ArrayType

How was this patch tested?

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

viirya commented Jun 20, 2016

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng Jun 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vlad17 commented Jun 20, 2016

Uh oh!

vlad17 commented Jun 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vlad17 commented Jun 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Jun 20, 2016

Uh oh!

mengxr commented Jun 21, 2016

Uh oh!

viirya commented Jun 21, 2016

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jun 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 22, 2016

Uh oh!

SparkQA commented Jun 22, 2016

Uh oh!

viirya commented Jun 24, 2016

Uh oh!

viirya commented Jun 24, 2016

Uh oh!

viirya Jun 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

viirya commented Jul 7, 2016

Uh oh!

vlad17 commented Jul 7, 2016

Uh oh!

cloud-fan commented Jul 7, 2016

Uh oh!

viirya commented Jul 8, 2016

viirya commented Jun 20, 2016 •

edited

Loading

liancheng Jun 20, 2016 •

edited

Loading

vlad17 commented Jun 20, 2016 •

edited

Loading

vlad17 commented Jun 20, 2016 •

edited

Loading

viirya Jun 22, 2016 •

edited

Loading

viirya Jun 28, 2016 •

edited

Loading

viirya commented Jul 10, 2016 •

edited

Loading