Skip to content

Conversation

@marmbrus
Copy link
Contributor

This PR is based on work by @cloud-fan in #4904, but with two differences:

  • We isolate the logic for Sort's special handling into ResolveSortReferences
  • We avoid creating UnresolvedGetField expressions during resolution. Instead we either resolve GetField or we return None. This avoids us going down the wrong path early on.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29187 has finished for PR 5189 at commit e35b27c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]

@cloud-fan
Copy link
Contributor

Since we decide to resolve Sort in ResolveSortReferences, then we need to stop resolve Sort in ResolveReferences. For example,

test("quick") {
    jsonRDD(sparkContext.makeRDD(
      """{"a": {"b": 1}, "b": 2}""" ::
      """{"a": {"b": 2}, "b": 1}""" :: Nil)).registerTempTable("t")
    checkAnswer(sql("SELECT a.b FROM t ORDER BY b"), Seq(Row(2), Row(1)))
  }

This test will fail, because we have mistakenly resolved b as a.b in ResolveReferences(we try to resolve b on top of Projection("a.b") and success). A quick fix is just make sure we don't resolve Sort in ResolveReferences like case u @ UnresolvedAttribute(name) if !q.isInstanceOf[Sort]. Do you have better idea @marmbrus ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even there is no missing in project, we still need to build new Sort with resolved SortOrder.

@cloud-fan
Copy link
Contributor

Another problem, for something like SELECT DISTINCT a FROM t ORDER BY a, we can't resolve this Sort because its child is not Project or Aggregate, thus can't handled by ResolveSortReferences.

@marmbrus
Copy link
Contributor Author

@cloud-fan, thanks for the further comments. Regarding not resolving sort in in ResolveReferences, I agree that is a little confusing and I think you are correct for standard SQL. However, this is not how hive does resolution and I don't think that we can differ from their implementation. Consider the following:

  test("hive orderby queries") {
    val df = jsonRDD(sparkContext.makeRDD(
      """{"a": {"b": 1}, "b": 2}""" ::
      """{"a": {"b": 2}, "b": 1}""" :: Nil))registerTempTable("t")
    sql("CREATE TABLE t1 AS SELECT * FROM t")
    runSqlHive("SELECT a.b FROM t1 ORDER BY b").foreach(println)
    sql("SELECT a.b FROM t1 ORDER BY b").collect().foreach(println)
  }

Which outputs:

1
2
[1]
[2]

So given this, I think the current way we resolve things (i.e. first resolve based on what comes from the select clause and then fall back on its child only when this fails) is non-standard but what we must do to avoid breaking existing HiveQL queries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment? Seems we will not add UnresolvedGetField any more.

@SparkQA
Copy link

SparkQA commented Mar 26, 2015

Test build #29255 has finished for PR 5189 at commit ca02d85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Ah, so SELECT a.b is equal to SELECT a.b AS b, and the b in ORDER BY b should reference to a.b, that's how hive do resolution and makes sense. I have sent you a PR to remove the duplicated resolveGetField and others LGTM.

@SparkQA
Copy link

SparkQA commented Mar 29, 2015

Test build #29367 has finished for PR 5189 at commit b8cae45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Analyzer(
    • trait CheckAnalysis
  • This patch does not change any dependencies.

@yhuai
Copy link
Contributor

yhuai commented Mar 29, 2015

LGTM

@asfgit asfgit closed this in cd48ca5 Mar 31, 2015
asfgit pushed a commit that referenced this pull request Mar 31, 2015
This PR is based on work by cloud-fan in #4904, but with two differences:
 - We isolate the logic for Sort's special handling into `ResolveSortReferences`
 - We avoid creating UnresolvedGetField expressions during resolution.  Instead we either resolve GetField or we return None.  This avoids us going down the wrong path early on.

Author: Michael Armbrust <[email protected]>

Closes #5189 from marmbrus/nestedOrderBy and squashes the following commits:

b8cae45 [Michael Armbrust] fix another test
0f36a11 [Michael Armbrust] WIP
91820cd [Michael Armbrust] Fix bug.

(cherry picked from commit cd48ca5)
Signed-off-by: Michael Armbrust <[email protected]>

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
asfgit pushed a commit that referenced this pull request Apr 8, 2015
It's after #5189

Author: Wenchen Fan <[email protected]>

Closes #5304 from cloud-fan/tmp and squashes the following commits:

c58c9b3 [Wenchen Fan] remove duplicated code and update comment
asfgit pushed a commit that referenced this pull request Jun 17, 2015
This PR is a improvement for #5189.

The resolution rule for ORDER BY is: first resolve based on what comes from the select clause and then fall back on its child only when this fails.

There are 2 steps. First, try to resolve `Sort` in `ResolveReferences` based on select clause, and ignore exceptions. Second, try to resolve `Sort` in `ResolveSortReferences` and add missing projection.

However, the way we resolve `SortOrder` is wrong. We just resolve `UnresolvedAttribute` and use the result to indicate if we can resolve `SortOrder`. But `UnresolvedAttribute` is only part of `GetField` chain(broken by `GetItem`), so we need to go through the whole chain to indicate if we can resolve `SortOrder`.

With this change, we can also avoid re-throw GetField exception in `CheckAnalysis` which is little ugly.

Author: Wenchen Fan <[email protected]>

Closes #5659 from cloud-fan/order-by and squashes the following commits:

cfa79f8 [Wenchen Fan] update test
3245d28 [Wenchen Fan] minor improve
465ee07 [Wenchen Fan] address comment
1fc41a2 [Wenchen Fan] fix SPARK-7067
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This PR is a improvement for apache#5189.

The resolution rule for ORDER BY is: first resolve based on what comes from the select clause and then fall back on its child only when this fails.

There are 2 steps. First, try to resolve `Sort` in `ResolveReferences` based on select clause, and ignore exceptions. Second, try to resolve `Sort` in `ResolveSortReferences` and add missing projection.

However, the way we resolve `SortOrder` is wrong. We just resolve `UnresolvedAttribute` and use the result to indicate if we can resolve `SortOrder`. But `UnresolvedAttribute` is only part of `GetField` chain(broken by `GetItem`), so we need to go through the whole chain to indicate if we can resolve `SortOrder`.

With this change, we can also avoid re-throw GetField exception in `CheckAnalysis` which is little ugly.

Author: Wenchen Fan <[email protected]>

Closes apache#5659 from cloud-fan/order-by and squashes the following commits:

cfa79f8 [Wenchen Fan] update test
3245d28 [Wenchen Fan] minor improve
465ee07 [Wenchen Fan] address comment
1fc41a2 [Wenchen Fan] fix SPARK-7067
@marmbrus marmbrus deleted the nestedOrderBy branch August 3, 2015 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants