-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29503][SQL] Remove conversion CreateNamedStruct to CreateNamedStructUnsafe #26173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,8 +18,12 @@ | |
| package org.apache.spark.sql | ||
|
|
||
| import org.apache.spark.sql.catalyst.DefinedByConstructorParams | ||
| import org.apache.spark.sql.catalyst.expressions.Expression | ||
| import org.apache.spark.sql.catalyst.expressions.objects.MapObjects | ||
| import org.apache.spark.sql.functions._ | ||
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.test.SharedSparkSession | ||
| import org.apache.spark.sql.types.ArrayType | ||
|
|
||
| /** | ||
| * A test suite to test DataFrame/SQL functionalities with complex types (i.e. array, struct, map). | ||
|
|
@@ -64,6 +68,24 @@ class DataFrameComplexTypeSuite extends QueryTest with SharedSparkSession { | |
| val ds100_5 = Seq(S100_5()).toDS() | ||
| ds100_5.rdd.count | ||
| } | ||
|
|
||
| test("SPARK-29503 nest unsafe struct inside safe array") { | ||
| withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false") { | ||
| val df = spark.sparkContext.parallelize(Seq(Seq(1, 2, 3))).toDF("items") | ||
|
|
||
| // items: Seq[Int] => items.map { item => Seq(Struct(item)) } | ||
| val result = df.select( | ||
| new Column(MapObjects( | ||
| (item: Expression) => array(struct(new Column(item))).expr, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, while fix seems fine to me too, was this only the reproducer?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't spent another time to try it (as it seems to be clean and simple reproducer). I'm not sure it's not going to be valid reproducer just due to pulling catalyst package. Catalyst could analyze the query and inject it if necessary in any way. I indicated you'd like to revisit #25745 - that was WIP and it didn't have any number of performance gain. I'd rather choose "safeness" over "speed", and even we haven't figured out there's outstanding difference between twos. It was the only one case MapObjects could have unsafe struct, by allowing this, safe and unsafe are possibly mixed up leading to encounter corner case.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I am not against this change. In that way, I think this fix is fine but wanted to know if this actually affects any user-facing surface. Was also wondering if we can benefit from #25745 since some investigations look already made there to completely use unsafe one instead.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My guess of #25745 is, it was based on the assumption that it's safe to replace CreateNamedStruct with CreateNamedStructUnsafe as we already have one path to do this - and this observation broke the assumption. IMHO, once we found it's not safe to do this, the improvement has to prove safety before we take its benefits into account. |
||
| $"items".expr, | ||
| df.schema("items").dataType.asInstanceOf[ArrayType].elementType | ||
| )) as "items" | ||
| ).collect() | ||
|
|
||
| assert(result.size === 1) | ||
| assert(result === Row(Seq(Seq(Row(1)), Seq(Row(2)), Seq(Row(3)))) :: Nil) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| class S100( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue is encountered only when whole stage codegen is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least yes for provided reproducer. For other case I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. For whole stage codegen, CreateNamedStruct is not converted to CreateNamedStructUnsafe, so the nested struct is not unsafe one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK got it. Thanks for explanation.