-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-6145][SQL] fix ORDER BY on nested fields #4918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #28307 has started for PR 4918 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather use check answer for these tests, especially if we are going to put them in SQLQuerySuite. When check answer fails it'll give nice exceptions and then we test end to end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried the latest master code in Spark SQL CLI:
create table struct1 as select named_struct("a",key, "b", value) as a from src limit 1;
select 1 from struct1 order by a.b; -- OK
select a.a from struct1 order by a.b; -- failed
select a.a from struct1 order by a.a; -- failed
org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type StringType;
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.resolveGetField(Analyzer.scala:307)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:271)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:260)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, In Hive
hive>select a.a from struct1 order by a; -- Works
hive>select a.b from struct1 order by b; -- Works
hive> select a.a from struct1 order by a.a;
FAILED: SemanticException [Error 10042]: Line 1:33 . Operator is only supported on struct or list of struct types 'a'
hive> select 1 from struct1 order by a.a;
FAILED: SemanticException [Error 10004]: Line 1:31 Invalid table alias or column reference 'a': (possible column names are: _c0)
hive> select _c0 from struct1 order by a.a;
FAILED: ParseException line 1:11 missing \' at 'from' near '<EOF>'
Seems Hive has bugs on this ambiguous attribute references, that's why I think we probably need to change that code:
https://github.com/apache/spark/pull/4892/files#diff-27c76f96a7b2733ecfd6f46a1716e153R201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or seems Hive only support the ORDER BY referenced attributes to be listed in the projection list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, analyzed is not actually checking analysis. Ugh... My mistake.
I think the bug here is that we are partially analyzing nested field accesses. We should not resolve the a in a.a unless we can also resolve the field access too.
The fact that Hive only supports ordering on things from the SELECT clause sounds like a bug to me. That is not how the SQL spec works right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supports ordering more than attributes from the SELECT clause should be the feature of Spark SQL, so seems we may not able to keep the same name convention as Hive does for the nested data accessing, but will that break lots of stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you end up making things ambiguous, why not just alias the unnesting manually? I do not think it is okay to change the default unnesting alias anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly what I described https://github.com/apache/spark/pull/4892/files#diff-27c76f96a7b2733ecfd6f46a1716e153R201
But the bug you raised in #4892 is quite interesting
sqlContext.jsonRDD(sc.parallelize("""{"a": {"a": {"a": 1}}, "c": 1}""" :: Nil)).registerTempTable("nestedOrder")
sqlContext.sql("SELECT a.a.a FROM nestedOrder ORDER BY a.a.a")Anyway, I will do some investigation on some other database systems other than Hive.
|
Test build #28307 has finished for PR 4918 at commit
|
|
Test PASSed. |
Based on #4904 with style errors fixed. `LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain". So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain". Author: Wenchen Fan <[email protected]> Author: Michael Armbrust <[email protected]> Closes #4918 from marmbrus/pr/4904 and squashes the following commits: 997f84e [Michael Armbrust] fix style 3eedbfc [Wenchen Fan] fix 6145 (cherry picked from commit 5873c71) Signed-off-by: Michael Armbrust <[email protected]>
|
Hi @marmbrus , I studied how we handle ORDER BY and had a more complete fix. and there are 2 parts need to be resolved: "a" and "b" in So first we can resolve |
Based on #4904 with style errors fixed.
LogicalPlan#resolvewill not only produceAttribute, but also "GetFieldchain".So in
ResolveSortReferences, after resolve the ordering expressions, we should not just collect theAttributeresults, but alsoAttributeat the bottom of "GetFieldchain".