-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11743][SQL] Add UserDefinedType support to RowEncoder #9712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #45923 has finished for PR 9712 at commit
|
|
Test build #45926 has finished for PR 9712 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question here. According to the Javadoc of Row, a user should use and only use Row for StructType field. It's ok to support Product too, but do we have a reason for this? Is it needed for the UDT stuff? Sorry I'm not familiar with UDT handling, it will be good if you can explain it in detail, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I found this problem when working on ScalaUDF. ScalaUDF will use schemaFor to obtain catalyst type for UDF input and output. The catalyst type returned by schemaFor for a Product is StructType. It is reasonable as we don't have other type to represent Product as I see.
So for a StructType field in an external Row, both Row and Product are possible values. When we call extractorsFor on the external Row, externalDataTypeFor will return ObjectType(classOf[Row]) for this field. But the get accessor on the inputObject (i.e., the Row) will possibly return a Product for the ScalaUDF case and an exception will be thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one of the input parameter is Tuple2, then we need to use the encoder to decode a catalyst value to external value, i.e. decode an InternalRow object to Tuple2 object. I think this is hard for a RowEncoder(your change only makes it possible to encode a Product into InternalRow, but not vice versa), we should use ProductEncoder for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have an input parameter mapping to a StructType field in an InternalRow, we will use Row as its input type. E.g., sqlContext.udf.register("udfFunc", (ns: Row) => { (ns.getInt(0), ns.getString(1)) }). But we can't use Row as output type for an UDF. Because we can still get the input schema of ScalaUDF's children expressions later if we can't infer input types correctly by using schemaFor. However, the output types of the UDF can be only inferred by schemaFor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. We need the type tag to infer the return type of UDF, and if a Row is returned, there is no type information we can get. How about we use ProductEncoder or FlatEncoder for the return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, as I tried, I found a problem is we may not always be able to get the type T needed to construct ProductEncoder and FlatEncoder. Even we can get it, we can't keep it in ScalaUDF due to serialization issue. So I think using RowEncoder is more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason we still support Product for StructType is for backward-compatibility, we did not enforce the inbound type before, someone may reply one it (because it's easier than Row in Scala).
|
Test build #45931 has finished for PR 9712 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we merge these branches together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. updated.
|
Test build #45992 has finished for PR 9712 at commit
|
|
retest this please. |
|
Test build #45994 has finished for PR 9712 at commit
|
|
LGTM, merging this into master and 1.6 branch, thanks! |
JIRA: https://issues.apache.org/jira/browse/SPARK-11743 RowEncoder doesn't support UserDefinedType now. We should add the support for it. Author: Liang-Chi Hsieh <[email protected]> Closes #9712 from viirya/rowencoder-udt. (cherry picked from commit b0c3fd3) Signed-off-by: Davies Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why we put this test here instead of adding arrayOfUDT type in encodeDecodeTest like structOfUDT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I moved it in #10538.
A following pr for #9712. Move the test for arrayOfUDT. Author: Liang-Chi Hsieh <[email protected]> Closes #10538 from viirya/move-udt-test.
A following pr for apache#9712. Move the test for arrayOfUDT. Author: Liang-Chi Hsieh <[email protected]> Closes apache#10538 from viirya/move-udt-test.
…object ## What changes were proposed in this pull request? This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column. This PR also removes the support to use `Product` as a valid external type of struct type. This support is added at #9712, but is incomplete, e.g. nested product, product in array are both not working. However, we never officially support this feature and I think it's ok to ban it. ## How was this patch tested? new tests in `RowEncoderSuite`. Author: Wenchen Fan <[email protected]> Closes #13401 from cloud-fan/bug. (cherry picked from commit 30c4774) Signed-off-by: Cheng Lian <[email protected]>
…object ## What changes were proposed in this pull request? This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column. This PR also removes the support to use `Product` as a valid external type of struct type. This support is added at #9712, but is incomplete, e.g. nested product, product in array are both not working. However, we never officially support this feature and I think it's ok to ban it. ## How was this patch tested? new tests in `RowEncoderSuite`. Author: Wenchen Fan <[email protected]> Closes #13401 from cloud-fan/bug.
JIRA: https://issues.apache.org/jira/browse/SPARK-11743
RowEncoder doesn't support UserDefinedType now. We should add the support for it.