Skip to content

Conversation

@marmbrus
Copy link
Contributor

@marmbrus marmbrus commented Mar 5, 2015

Based on #4904 with style errors fixed.

LogicalPlan#resolve will not only produce Attribute, but also "GetField chain".
So in ResolveSortReferences, after resolve the ordering expressions, we should not just collect the Attribute results, but also Attribute at the bottom of "GetField chain".

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28307 has started for PR 4918 at commit 997f84e.

  • This patch merges cleanly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use check answer for these tests, especially if we are going to put them in SQLQuerySuite. When check answer fails it'll give nice exceptions and then we test end to end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried the latest master code in Spark SQL CLI:

create table struct1 as select named_struct("a",key, "b", value) as a from src limit 1;
select 1 from struct1 order by a.b; -- OK
select a.a from struct1 order by a.b; -- failed
select a.a from struct1 order by a.a; -- failed
org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type StringType;
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.resolveGetField(Analyzer.scala:307)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:271)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:260)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, In Hive

hive>select a.a from struct1 order by a; -- Works
hive>select a.b from struct1 order by b; -- Works
hive> select a.a from struct1 order by a.a;
FAILED: SemanticException [Error 10042]: Line 1:33 . Operator is only supported on struct or list of struct types 'a'
hive> select 1 from struct1 order by a.a;  
FAILED: SemanticException [Error 10004]: Line 1:31 Invalid table alias or column reference 'a': (possible column names are: _c0)
hive> select _c0 from struct1 order by a.a;
FAILED: ParseException line 1:11 missing \' at 'from' near '<EOF>'

Seems Hive has bugs on this ambiguous attribute references, that's why I think we probably need to change that code:
https://github.com/apache/spark/pull/4892/files#diff-27c76f96a7b2733ecfd6f46a1716e153R201

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or seems Hive only support the ORDER BY referenced attributes to be listed in the projection list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, analyzed is not actually checking analysis. Ugh... My mistake.

I think the bug here is that we are partially analyzing nested field accesses. We should not resolve the a in a.a unless we can also resolve the field access too.

The fact that Hive only supports ordering on things from the SELECT clause sounds like a bug to me. That is not how the SQL spec works right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supports ordering more than attributes from the SELECT clause should be the feature of Spark SQL, so seems we may not able to keep the same name convention as Hive does for the nested data accessing, but will that break lots of stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you end up making things ambiguous, why not just alias the unnesting manually? I do not think it is okay to change the default unnesting alias anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly what I described https://github.com/apache/spark/pull/4892/files#diff-27c76f96a7b2733ecfd6f46a1716e153R201
But the bug you raised in #4892 is quite interesting

sqlContext.jsonRDD(sc.parallelize("""{"a": {"a": {"a": 1}}, "c": 1}""" :: Nil)).registerTempTable("nestedOrder")
sqlContext.sql("SELECT a.a.a FROM nestedOrder ORDER BY a.a.a")

Anyway, I will do some investigation on some other database systems other than Hive.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28307 has finished for PR 4918 at commit 997f84e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28307/
Test PASSed.

asfgit pushed a commit that referenced this pull request Mar 5, 2015
Based on #4904 with style errors fixed.

`LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain".
So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain".

Author: Wenchen Fan <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #4918 from marmbrus/pr/4904 and squashes the following commits:

997f84e [Michael Armbrust] fix style
3eedbfc [Wenchen Fan] fix 6145

(cherry picked from commit 5873c71)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 5873c71 Mar 5, 2015
@cloud-fan
Copy link
Contributor

Hi @marmbrus , I studied how we handle ORDER BY and had a more complete fix.
For simple example "SELECT a, b FROM t ORDER BY b", it will be parsed into

Sort(
  Seq(SortOrder(Unresolved("b"))),
  Projection(Seq(Unresolved("a"), Unresolved("b")), Relation)
)

and there are 2 parts need to be resolved: "a" and "b" in Projection, "a" in SortOrder.
Remember, we resolve those unresolved stuff in ResolveReferences rule by code

case u @ UnresolvedAttribute(name) =>
  // Leave unchanged if resolution fails.  Hopefully will be resolved next round.
  val result = q.resolveChildren(name, resolver).getOrElse(u)
  logDebug(s"Resolving $u to $result")
  result

So first we can resolve Projection successfully, and when we resolve Sort, its input is Projection's output. And the ORDER BY referenced attribute "a" exists in the projection list, so we can also resolve Sort successfully.
But for SELECT a FROM t ORDER BY b, the ORDER BY referenced attribute "a" doesn't exist in the projection list, and what we do is expanding the projection list by ResolveSortReferences rule. I think this rule's name is confusing, it doesn't resolve Sort(actually SortOrders in Sort), but only expanding projection list or grouping expressions, we still resolve Sort in ResolveReferences rule(we can resolve it successfully after expanding projection list).
This works well until we have nested fields. For example "SELECT a.b FROM t ORDER BY b.a", what we should do is adding "b" to projection list, but actually we will get error when try to resolve Sort in ResolveReferences rule. "a.b" will have alias name "b", and we will mistakenly think we have Attribute "b" and try to resolve "b.a" on top of it.
One possible fix is changing alias name of GetField chain like @chenghao-intel did in #4892, but it need more work to handle GetItem stuff. I think a better idea is override resolveChildren method in Sort and filter out the GetField stuff from Projection's output. I have done it in #4904.
So this PR is not complete, it do fixes some issues but not all of them, please review #4904 and discuss with me if you have any questions.

@marmbrus marmbrus deleted the pr/4904 branch March 9, 2015 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants