Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Nov 14, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-11743

RowEncoder doesn't support UserDefinedType now. We should add the support for it.

@SparkQA
Copy link

SparkQA commented Nov 14, 2015

Test build #45923 has finished for PR 9712 at commit 2721741.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 14, 2015

Test build #45926 has finished for PR 9712 at commit bab5c5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question here. According to the Javadoc of Row, a user should use and only use Row for StructType field. It's ok to support Product too, but do we have a reason for this? Is it needed for the UDT stuff? Sorry I'm not familiar with UDT handling, it will be good if you can explain it in detail, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I found this problem when working on ScalaUDF. ScalaUDF will use schemaFor to obtain catalyst type for UDF input and output. The catalyst type returned by schemaFor for a Product is StructType. It is reasonable as we don't have other type to represent Product as I see.

So for a StructType field in an external Row, both Row and Product are possible values. When we call extractorsFor on the external Row, externalDataTypeFor will return ObjectType(classOf[Row]) for this field. But the get accessor on the inputObject (i.e., the Row) will possibly return a Product for the ScalaUDF case and an exception will be thrown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of the input parameter is Tuple2, then we need to use the encoder to decode a catalyst value to external value, i.e. decode an InternalRow object to Tuple2 object. I think this is hard for a RowEncoder(your change only makes it possible to encode a Product into InternalRow, but not vice versa), we should use ProductEncoder for this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have an input parameter mapping to a StructType field in an InternalRow, we will use Row as its input type. E.g., sqlContext.udf.register("udfFunc", (ns: Row) => { (ns.getInt(0), ns.getString(1)) }). But we can't use Row as output type for an UDF. Because we can still get the input schema of ScalaUDF's children expressions later if we can't infer input types correctly by using schemaFor. However, the output types of the UDF can be only inferred by schemaFor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. We need the type tag to infer the return type of UDF, and if a Row is returned, there is no type information we can get. How about we use ProductEncoder or FlatEncoder for the return value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, as I tried, I found a problem is we may not always be able to get the type T needed to construct ProductEncoder and FlatEncoder. Even we can get it, we can't keep it in ScalaUDF due to serialization issue. So I think using RowEncoder is more reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason we still support Product for StructType is for backward-compatibility, we did not enforce the inbound type before, someone may reply one it (because it's easier than Row in Scala).

@SparkQA
Copy link

SparkQA commented Nov 14, 2015

Test build #45931 has finished for PR 9712 at commit db644fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we merge these branches together?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. updated.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45992 has finished for PR 9712 at commit a5fdbce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class JSONOptions(\n

@viirya
Copy link
Member Author

viirya commented Nov 16, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45994 has finished for PR 9712 at commit a5fdbce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class JSONOptions(\n

@davies
Copy link
Contributor

davies commented Nov 16, 2015

LGTM, merging this into master and 1.6 branch, thanks!

asfgit pushed a commit that referenced this pull request Nov 16, 2015
JIRA: https://issues.apache.org/jira/browse/SPARK-11743

RowEncoder doesn't support UserDefinedType now. We should add the support for it.

Author: Liang-Chi Hsieh <[email protected]>

Closes #9712 from viirya/rowencoder-udt.

(cherry picked from commit b0c3fd3)
Signed-off-by: Davies Liu <[email protected]>
@asfgit asfgit closed this in b0c3fd3 Nov 16, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why we put this test here instead of adding arrayOfUDT type in encodeDecodeTest like structOfUDT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I moved it in #10538.

asfgit pushed a commit that referenced this pull request Jan 1, 2016
A following pr for #9712. Move the test for arrayOfUDT.

Author: Liang-Chi Hsieh <[email protected]>

Closes #10538 from viirya/move-udt-test.
marmbrus pushed a commit to marmbrus/spark that referenced this pull request Jan 7, 2016
A following pr for apache#9712. Move the test for arrayOfUDT.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#10538 from viirya/move-udt-test.
asfgit pushed a commit that referenced this pull request Jun 5, 2016
…object

## What changes were proposed in this pull request?

This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column.

This PR also removes the support to use `Product` as a valid external type of struct type.  This support is added at #9712, but is incomplete, e.g. nested product, product in array are both not working.  However, we never officially support this feature and I think it's ok to ban it.

## How was this patch tested?

new tests in `RowEncoderSuite`.

Author: Wenchen Fan <[email protected]>

Closes #13401 from cloud-fan/bug.

(cherry picked from commit 30c4774)
Signed-off-by: Cheng Lian <[email protected]>
asfgit pushed a commit that referenced this pull request Jun 5, 2016
…object

## What changes were proposed in this pull request?

This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column.

This PR also removes the support to use `Product` as a valid external type of struct type.  This support is added at #9712, but is incomplete, e.g. nested product, product in array are both not working.  However, we never officially support this feature and I think it's ok to ban it.

## How was this patch tested?

new tests in `RowEncoderSuite`.

Author: Wenchen Fan <[email protected]>

Closes #13401 from cloud-fan/bug.
@viirya viirya deleted the rowencoder-udt branch December 27, 2023 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants