Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -385,7 +385,7 @@ class SqlParser extends AbstractSparkSQLParser {

protected lazy val dotExpressionHeader: Parser[Expression] =
(ident <~ ".") ~ ident ~ rep("." ~> ident) ^^ {
case i1 ~ i2 ~ rest => UnresolvedAttribute(i1 + "." + i2 + rest.mkString(".", ".", ""))
case i1 ~ i2 ~ rest => UnresolvedAttribute((Seq(i1, i2) ++ rest).mkString("."))
}

protected lazy val dataType: Parser[DataType] =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,7 @@ class Analyzer(catalog: Catalog,
}

/**
* In many dialects of SQL is it valid to sort by attributes that are not present in the SELECT
* In many dialects of SQL it is valid to sort by attributes that are not present in the SELECT
* clause. This rule detects such queries and adds the required attributes to the original
* projection, so that they will be available during sorting. Another projection is added to
* remove these attributes after sorting.
Expand All @@ -321,7 +321,8 @@ class Analyzer(catalog: Catalog,
if !s.resolved && p.resolved =>
val unresolved = ordering.flatMap(_.collect { case UnresolvedAttribute(name) => name })
val resolved = unresolved.flatMap(child.resolve(_, resolver))
val requiredAttributes = AttributeSet(resolved.collect { case a: Attribute => a })
val requiredAttributes =
AttributeSet(resolved.flatMap(_.collect { case a: Attribute => a }))

val missingInProject = requiredAttributes -- p.output
if (missingInProject.nonEmpty) {
Expand Down
10 changes: 10 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1049,4 +1049,14 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll {
rdd.toDF().registerTempTable("distinctData")
checkAnswer(sql("SELECT COUNT(DISTINCT key,value) FROM distinctData"), Row(2))
}

test("SPARK-6145: ORDER BY test for nested fields") {
jsonRDD(sparkContext.makeRDD(
"""{"a": {"b": 1, "a": {"a": 1}}, "c": [{"d": 1}]}""" :: Nil)).registerTempTable("nestedOrder")
// These should be successfully analyzed
sql("SELECT 1 FROM nestedOrder ORDER BY a.b").queryExecution.analyzed
sql("SELECT a.b FROM nestedOrder ORDER BY a.b").queryExecution.analyzed
sql("SELECT 1 FROM nestedOrder ORDER BY a.a.a").queryExecution.analyzed
sql("SELECT 1 FROM nestedOrder ORDER BY c[0].d").queryExecution.analyzed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use check answer for these tests, especially if we are going to put them in SQLQuerySuite. When check answer fails it'll give nice exceptions and then we test end to end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried the latest master code in Spark SQL CLI:

create table struct1 as select named_struct("a",key, "b", value) as a from src limit 1;
select 1 from struct1 order by a.b; -- OK
select a.a from struct1 order by a.b; -- failed
select a.a from struct1 order by a.a; -- failed
org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type StringType;
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.resolveGetField(Analyzer.scala:307)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:271)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:260)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, In Hive

hive>select a.a from struct1 order by a; -- Works
hive>select a.b from struct1 order by b; -- Works
hive> select a.a from struct1 order by a.a;
FAILED: SemanticException [Error 10042]: Line 1:33 . Operator is only supported on struct or list of struct types 'a'
hive> select 1 from struct1 order by a.a;  
FAILED: SemanticException [Error 10004]: Line 1:31 Invalid table alias or column reference 'a': (possible column names are: _c0)
hive> select _c0 from struct1 order by a.a;
FAILED: ParseException line 1:11 missing \' at 'from' near '<EOF>'

Seems Hive has bugs on this ambiguous attribute references, that's why I think we probably need to change that code:
https://github.com/apache/spark/pull/4892/files#diff-27c76f96a7b2733ecfd6f46a1716e153R201

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or seems Hive only support the ORDER BY referenced attributes to be listed in the projection list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, analyzed is not actually checking analysis. Ugh... My mistake.

I think the bug here is that we are partially analyzing nested field accesses. We should not resolve the a in a.a unless we can also resolve the field access too.

The fact that Hive only supports ordering on things from the SELECT clause sounds like a bug to me. That is not how the SQL spec works right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supports ordering more than attributes from the SELECT clause should be the feature of Spark SQL, so seems we may not able to keep the same name convention as Hive does for the nested data accessing, but will that break lots of stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you end up making things ambiguous, why not just alias the unnesting manually? I do not think it is okay to change the default unnesting alias anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly what I described https://github.com/apache/spark/pull/4892/files#diff-27c76f96a7b2733ecfd6f46a1716e153R201
But the bug you raised in #4892 is quite interesting

sqlContext.jsonRDD(sc.parallelize("""{"a": {"a": {"a": 1}}, "c": 1}""" :: Nil)).registerTempTable("nestedOrder")
sqlContext.sql("SELECT a.a.a FROM nestedOrder ORDER BY a.a.a")

Anyway, I will do some investigation on some other database systems other than Hive.

}
}