[SPARK-6145][SQL] fix ORDER BY on nested fields #5189

marmbrus · 2015-03-25T22:06:21Z

This PR is based on work by @cloud-fan in #4904, but with two differences:

We isolate the logic for Sort's special handling into ResolveSortReferences
We avoid creating UnresolvedGetField expressions during resolution. Instead we either resolve GetField or we return None. This avoids us going down the wrong path early on.

SparkQA · 2015-03-25T23:29:32Z

Test build #29187 has finished for PR 5189 at commit e35b27c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]

cloud-fan · 2015-03-26T03:52:16Z

Since we decide to resolve Sort in ResolveSortReferences, then we need to stop resolve Sort in ResolveReferences. For example,

test("quick") {
    jsonRDD(sparkContext.makeRDD(
      """{"a": {"b": 1}, "b": 2}""" ::
      """{"a": {"b": 2}, "b": 1}""" :: Nil)).registerTempTable("t")
    checkAnswer(sql("SELECT a.b FROM t ORDER BY b"), Seq(Row(2), Row(1)))
  }

This test will fail, because we have mistakenly resolved b as a.b in ResolveReferences(we try to resolve b on top of Projection("a.b") and success). A quick fix is just make sure we don't resolve Sort in ResolveReferences like case u @ UnresolvedAttribute(name) if !q.isInstanceOf[Sort]. Do you have better idea @marmbrus ?

cloud-fan · 2015-03-26T04:55:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Even there is no missing in project, we still need to build new Sort with resolved SortOrder.

cloud-fan · 2015-03-26T05:15:01Z

Another problem, for something like SELECT DISTINCT a FROM t ORDER BY a, we can't resolve this Sort because its child is not Project or Aggregate, thus can't handled by ResolveSortReferences.

marmbrus · 2015-03-26T19:35:21Z

@cloud-fan, thanks for the further comments. Regarding not resolving sort in in ResolveReferences, I agree that is a little confusing and I think you are correct for standard SQL. However, this is not how hive does resolution and I don't think that we can differ from their implementation. Consider the following:

  test("hive orderby queries") {
    val df = jsonRDD(sparkContext.makeRDD(
      """{"a": {"b": 1}, "b": 2}""" ::
      """{"a": {"b": 2}, "b": 1}""" :: Nil))registerTempTable("t")
    sql("CREATE TABLE t1 AS SELECT * FROM t")
    runSqlHive("SELECT a.b FROM t1 ORDER BY b").foreach(println)
    sql("SELECT a.b FROM t1 ORDER BY b").collect().foreach(println)
  }

Which outputs:

1
2
[1]
[2]

So given this, I think the current way we resolve things (i.e. first resolve based on what comes from the select clause and then fall back on its child only when this fails) is non-standard but what we must do to avoid breaking existing HiveQL queries.

yhuai · 2015-03-26T20:24:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

Update comment? Seems we will not add UnresolvedGetField any more.

SparkQA · 2015-03-26T22:06:16Z

Test build #29255 has finished for PR 5189 at commit ca02d85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-03-27T03:13:14Z

Ah, so SELECT a.b is equal to SELECT a.b AS b, and the b in ORDER BY b should reference to a.b, that's how hive do resolution and makes sense. I have sent you a PR to remove the duplicated resolveGetField and others LGTM.

SparkQA · 2015-03-29T20:11:37Z

Test build #29367 has finished for PR 5189 at commit b8cae45.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Analyzer(
- trait CheckAnalysis
This patch does not change any dependencies.

yhuai · 2015-03-29T21:10:53Z

LGTM

This PR is based on work by cloud-fan in #4904, but with two differences: - We isolate the logic for Sort's special handling into `ResolveSortReferences` - We avoid creating UnresolvedGetField expressions during resolution. Instead we either resolve GetField or we return None. This avoids us going down the wrong path early on. Author: Michael Armbrust <[email protected]> Closes #5189 from marmbrus/nestedOrderBy and squashes the following commits: b8cae45 [Michael Armbrust] fix another test 0f36a11 [Michael Armbrust] WIP 91820cd [Michael Armbrust] Fix bug. (cherry picked from commit cd48ca5) Signed-off-by: Michael Armbrust <[email protected]> Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

It's after #5189 Author: Wenchen Fan <[email protected]> Closes #5304 from cloud-fan/tmp and squashes the following commits: c58c9b3 [Wenchen Fan] remove duplicated code and update comment

This PR is a improvement for #5189. The resolution rule for ORDER BY is: first resolve based on what comes from the select clause and then fall back on its child only when this fails. There are 2 steps. First, try to resolve `Sort` in `ResolveReferences` based on select clause, and ignore exceptions. Second, try to resolve `Sort` in `ResolveSortReferences` and add missing projection. However, the way we resolve `SortOrder` is wrong. We just resolve `UnresolvedAttribute` and use the result to indicate if we can resolve `SortOrder`. But `UnresolvedAttribute` is only part of `GetField` chain(broken by `GetItem`), so we need to go through the whole chain to indicate if we can resolve `SortOrder`. With this change, we can also avoid re-throw GetField exception in `CheckAnalysis` which is little ugly. Author: Wenchen Fan <[email protected]> Closes #5659 from cloud-fan/order-by and squashes the following commits: cfa79f8 [Wenchen Fan] update test 3245d28 [Wenchen Fan] minor improve 465ee07 [Wenchen Fan] address comment 1fc41a2 [Wenchen Fan] fix SPARK-7067

This PR is a improvement for apache#5189. The resolution rule for ORDER BY is: first resolve based on what comes from the select clause and then fall back on its child only when this fails. There are 2 steps. First, try to resolve `Sort` in `ResolveReferences` based on select clause, and ignore exceptions. Second, try to resolve `Sort` in `ResolveSortReferences` and add missing projection. However, the way we resolve `SortOrder` is wrong. We just resolve `UnresolvedAttribute` and use the result to indicate if we can resolve `SortOrder`. But `UnresolvedAttribute` is only part of `GetField` chain(broken by `GetItem`), so we need to go through the whole chain to indicate if we can resolve `SortOrder`. With this change, we can also avoid re-throw GetField exception in `CheckAnalysis` which is little ugly. Author: Wenchen Fan <[email protected]> Closes apache#5659 from cloud-fan/order-by and squashes the following commits: cfa79f8 [Wenchen Fan] update test 3245d28 [Wenchen Fan] minor improve 465ee07 [Wenchen Fan] address comment 1fc41a2 [Wenchen Fan] fix SPARK-7067

marmbrus mentioned this pull request Mar 25, 2015

[SPARK-6145][SQL] fix ORDER BY on nested fields #4904

Closed

cloud-fan reviewed Mar 26, 2015
View reviewed changes

yhuai reviewed Mar 26, 2015
View reviewed changes

marmbrus added 3 commits March 27, 2015 00:31

Fix bug.

91820cd

WIP

0f36a11

fix another test

b8cae45

marmbrus force-pushed the nestedOrderBy branch from ca02d85 to b8cae45 Compare March 29, 2015 18:42

asfgit closed this in cd48ca5 Mar 31, 2015

cloud-fan mentioned this pull request Apr 1, 2015

[SQL][minor] remove duplicated resolveGetField and update comment #5304

Closed

cloud-fan mentioned this pull request Apr 23, 2015

[SPARK-7067][SQL] fix bug when use complex nested fields in ORDER BY #5659

Closed

marmbrus deleted the nestedOrderBy branch August 3, 2015 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-6145][SQL] fix ORDER BY on nested fields #5189

[SPARK-6145][SQL] fix ORDER BY on nested fields #5189

Uh oh!

marmbrus commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

cloud-fan commented Mar 26, 2015

Uh oh!

cloud-fan Mar 26, 2015

Uh oh!

cloud-fan commented Mar 26, 2015

Uh oh!

marmbrus commented Mar 26, 2015

Uh oh!

yhuai Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

cloud-fan commented Mar 27, 2015

Uh oh!

SparkQA commented Mar 29, 2015

Uh oh!

yhuai commented Mar 29, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-6145][SQL] fix ORDER BY on nested fields #5189

[SPARK-6145][SQL] fix ORDER BY on nested fields #5189

Uh oh!

Conversation

marmbrus commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

cloud-fan commented Mar 26, 2015

Uh oh!

cloud-fan Mar 26, 2015

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 26, 2015

Uh oh!

marmbrus commented Mar 26, 2015

Uh oh!

yhuai Mar 26, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

cloud-fan commented Mar 27, 2015

Uh oh!

SparkQA commented Mar 29, 2015

Uh oh!

yhuai commented Mar 29, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants