[SPARK-6145][SQL] fix ORDER BY on nested fields #4918

marmbrus · 2015-03-05T20:21:54Z

Based on #4904 with style errors fixed.

LogicalPlan#resolve will not only produce Attribute, but also "GetField chain".
So in ResolveSortReferences, after resolve the ordering expressions, we should not just collect the Attribute results, but also Attribute at the bottom of "GetField chain".

SparkQA · 2015-03-05T20:22:43Z

Test build #28307 has started for PR 4918 at commit 997f84e.

This patch merges cleanly.

marmbrus · 2015-03-05T20:31:22Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

I'd rather use check answer for these tests, especially if we are going to put them in SQLQuerySuite. When check answer fails it'll give nice exceptions and then we test end to end.

I've tried the latest master code in Spark SQL CLI:

create table struct1 as select named_struct("a",key, "b", value) as a from src limit 1; select 1 from struct1 order by a.b; -- OK select a.a from struct1 order by a.b; -- failed select a.a from struct1 order by a.a; -- failed org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type StringType; at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.resolveGetField(Analyzer.scala:307) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:271) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:260) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

BTW, In Hive

hive>select a.a from struct1 order by a; -- Works hive>select a.b from struct1 order by b; -- Works hive> select a.a from struct1 order by a.a; FAILED: SemanticException [Error 10042]: Line 1:33 . Operator is only supported on struct or list of struct types 'a' hive> select 1 from struct1 order by a.a; FAILED: SemanticException [Error 10004]: Line 1:31 Invalid table alias or column reference 'a': (possible column names are: _c0) hive> select _c0 from struct1 order by a.a; FAILED: ParseException line 1:11 missing \' at 'from' near '<EOF>'

Seems Hive has bugs on this ambiguous attribute references, that's why I think we probably need to change that code:
https://github.com/apache/spark/pull/4892/files#diff-27c76f96a7b2733ecfd6f46a1716e153R201

Or seems Hive only support the ORDER BY referenced attributes to be listed in the projection list.

Oh right, analyzed is not actually checking analysis. Ugh... My mistake.

I think the bug here is that we are partially analyzing nested field accesses. We should not resolve the a in a.a unless we can also resolve the field access too.

The fact that Hive only supports ordering on things from the SELECT clause sounds like a bug to me. That is not how the SQL spec works right?

Supports ordering more than attributes from the SELECT clause should be the feature of Spark SQL, so seems we may not able to keep the same name convention as Hive does for the nested data accessing, but will that break lots of stuff?

If you end up making things ambiguous, why not just alias the unnesting manually? I do not think it is okay to change the default unnesting alias anymore.

Yes, that's exactly what I described https://github.com/apache/spark/pull/4892/files#diff-27c76f96a7b2733ecfd6f46a1716e153R201
But the bug you raised in #4892 is quite interesting

sqlContext.jsonRDD(sc.parallelize("""{"a": {"a": {"a": 1}}, "c": 1}""" :: Nil)).registerTempTable("nestedOrder") sqlContext.sql("SELECT a.a.a FROM nestedOrder ORDER BY a.a.a")

Anyway, I will do some investigation on some other database systems other than Hive.

SparkQA · 2015-03-05T21:29:37Z

Test build #28307 has finished for PR 4918 at commit 997f84e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-05T21:29:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28307/
Test PASSed.

Based on #4904 with style errors fixed. `LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain". So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain". Author: Wenchen Fan <[email protected]> Author: Michael Armbrust <[email protected]> Closes #4918 from marmbrus/pr/4904 and squashes the following commits: 997f84e [Michael Armbrust] fix style 3eedbfc [Wenchen Fan] fix 6145 (cherry picked from commit 5873c71) Signed-off-by: Michael Armbrust <[email protected]>

cloud-fan · 2015-03-06T14:19:38Z

Hi @marmbrus , I studied how we handle ORDER BY and had a more complete fix.
For simple example "SELECT a, b FROM t ORDER BY b", it will be parsed into

Sort(
  Seq(SortOrder(Unresolved("b"))),
  Projection(Seq(Unresolved("a"), Unresolved("b")), Relation)
)

and there are 2 parts need to be resolved: "a" and "b" in Projection, "a" in SortOrder.
Remember, we resolve those unresolved stuff in ResolveReferences rule by code

case u @ UnresolvedAttribute(name) =>
  // Leave unchanged if resolution fails.  Hopefully will be resolved next round.
  val result = q.resolveChildren(name, resolver).getOrElse(u)
  logDebug(s"Resolving $u to $result")
  result

So first we can resolve Projection successfully, and when we resolve Sort, its input is Projection's output. And the ORDER BY referenced attribute "a" exists in the projection list, so we can also resolve Sort successfully.
But for SELECT a FROM t ORDER BY b, the ORDER BY referenced attribute "a" doesn't exist in the projection list, and what we do is expanding the projection list by ResolveSortReferences rule. I think this rule's name is confusing, it doesn't resolve Sort(actually SortOrders in Sort), but only expanding projection list or grouping expressions, we still resolve Sort in ResolveReferences rule(we can resolve it successfully after expanding projection list).
This works well until we have nested fields. For example "SELECT a.b FROM t ORDER BY b.a", what we should do is adding "b" to projection list, but actually we will get error when try to resolve Sort in ResolveReferences rule. "a.b" will have alias name "b", and we will mistakenly think we have Attribute "b" and try to resolve "b.a" on top of it.
One possible fix is changing alias name of GetField chain like @chenghao-intel did in #4892, but it need more work to handle GetItem stuff. I think a better idea is override resolveChildren method in Sort and filter out the GetField stuff from Projection's output. I have done it in #4904.
So this PR is not complete, it do fixes some issues but not all of them, please review #4904 and discuss with me if you have any questions.

cloud-fan and others added 2 commits March 5, 2015 15:52

fix 6145

3eedbfc

fix style

997f84e

marmbrus mentioned this pull request Mar 5, 2015

[SPARK-6145][SQL] fix ORDER BY on nested fields #4904

Closed

marmbrus reviewed Mar 5, 2015
View reviewed changes

asfgit closed this in 5873c71 Mar 5, 2015

marmbrus deleted the pr/4904 branch March 9, 2015 20:16

[SPARK-6145][SQL] fix ORDER BY on nested fields #4918

[SPARK-6145][SQL] fix ORDER BY on nested fields #4918

Uh oh!

Conversation

marmbrus commented Mar 5, 2015

Uh oh!

SparkQA commented Mar 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 5, 2015

Uh oh!

AmplabJenkins commented Mar 5, 2015

Uh oh!

cloud-fan commented Mar 6, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants