[SPARK-20612][SQL] Throw exception when there is unresolvable attributes in Filter #17874

viirya · 2017-05-05T14:26:04Z

What changes were proposed in this pull request?

We have a rule in Analyzer that adds missing attributes in a Filter into its child plan. It makes the following codes work:

val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y")
df.select("y").where("x=1")

It should throw an analysis exception instead of implicitly adding the missing attributes into underlying plan.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

cloud-fan · 2017-05-05T14:52:22Z

is it a bug? IIRC this is intentional, so that the dataframe behavior is consistent with SQL.

SparkQA · 2017-05-05T16:54:30Z

Test build #76495 has finished for PR 17874 at commit 010d046.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-06T01:35:58Z

The rule is added by #12235. From the description and code comment, it should be just used for HAVING clause that access a grouping column that is not presented in SELECT clause, instead of a general rule to add missing attributes to Filter.

viirya · 2017-05-07T03:57:47Z

@cloud-fan This rule could make the query work:

Seq(1).toDF("c1").createOrReplaceTempView("onerow")
sql(
  """
    | select 1
    |        from   (select 1 from onerow t2 LIMIT 1)
    |        where  t2.c1=1""".stripMargin)

But the where condition should not be able to refer t2.c1 which is only available in the inner scope.

SparkQA · 2017-05-07T06:30:57Z

Test build #76538 has finished for PR 17874 at commit f19976a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-07T06:56:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

   * projection, so that they will be available during sorting. Another projection is added to
   * remove these attributes after sorting.
-   *
-   * The HAVING clause could also used a grouping columns that is not presented in the SELECT.


This is by design.

For example,

select type, avg (price) from titles group by type having sum (total_sales) > 10000

This example is copied from Sybase ASE. I believe this is part of Transact-SQL

This is by design.

For the HAVING clause could also used a grouping columns that is not presented in the SELECT, yes.

For other general cases, I doubt it.

We have other rule doing this (HAVING clause with grouping columns). That is why the tests are passed after this rule is removed. The above query also works without this rule.

Since we introduced this by accident, I do not think we can remove it now. It could break the applications that are built on it. cc @rxin @cloud-fan @marmbrus

cloud-fan · 2017-05-08T03:16:55Z

in postgres, select a from t where b > 0 can work, I think it's reasonable if df.select("y").where("x=1") works in spark.

Seq(1).toDF("c1").createOrReplaceTempView("onerow")
sql(
  """
    | select 1
    |        from   (select 1 from onerow t2 LIMIT 1)
    |        where  t2.c1=1""".stripMargin)

this one we should not support, we should not add missing attributes though subqueries.

viirya · 2017-05-08T03:52:29Z

select a from t where b > 0 works. However, it can be seen logically as:

Project [a]
  Filter [b > 0]
    Relation t [a, b]

It seems to me Spark also parses the above SQL query like this way.

There is an order of evaluation in SQL systems. E.g, MySQL:

select a from test where b > 2;   // works. `where` is evaluated before `select`.
select a from test having b > 2;  // not works. `having` is evaluated after `select`.

df.select("y").where("x=1") sematically asks a projection of just y attribute before filtering. It seems to me that it is different with the SQL query.

viirya · 2017-05-08T04:03:00Z

Maybe another point of view is, we can split df.select("y").where("x=1") to two different DataFrames:

val onlyY = df.select("y")  // The schema of onlyY is just "y" attribute
onlyY.where("x=1") // Then we filter on a non-existing attribute

cloud-fan · 2017-05-08T09:12:21Z

mllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala

    val model = new FPGrowth().setMinSupport(0.7).fit(dataset)
    val prediction = model.transform(df)
-    assert(prediction.select("prediction").where("id=3").first().getSeq[String](0).isEmpty)
+    assert(prediction.where("id=3").select("prediction").first().getSeq[String](0).isEmpty)


I'm worried that existing spark applications may already use this pattern in the code, so no matter it's a bug or not, it seems a feature now and we can't break it...

viirya · 2017-05-08T13:22:06Z

@cloud-fan Since you all concern about breaking existing applications, I'd close this. But I think we should not add missing attributes though subqueries like I showed above. I'll create another PR to fix it. What do you think?

cloud-fan · 2017-05-08T13:27:33Z

yea that example looks very weird and we should fix it, thanks!

Throw exception when there is unresolvable attributes in Filter.

010d046

viirya mentioned this pull request May 6, 2017

[SPARK-20392][SQL] Set barrier to prevent re-entering a tree #17770

Closed

Add test.

f19976a

viirya changed the title ~~[SPARK-20612][SQL][WIP] Throw exception when there is unresolvable attributes in Filter~~ [SPARK-20612][SQL] Throw exception when there is unresolvable attributes in Filter May 7, 2017

gatorsmile reviewed May 7, 2017

View reviewed changes

cloud-fan reviewed May 8, 2017

View reviewed changes

viirya closed this May 10, 2017

viirya deleted the SPARK-20612 branch December 27, 2023 18:20

[SPARK-20612][SQL] Throw exception when there is unresolvable attributes in Filter #17874

[SPARK-20612][SQL] Throw exception when there is unresolvable attributes in Filter #17874

Uh oh!

Conversation

viirya commented May 5, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented May 5, 2017

Uh oh!

SparkQA commented May 5, 2017

Uh oh!

viirya commented May 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented May 7, 2017

Uh oh!

SparkQA commented May 7, 2017

Uh oh!

gatorsmile May 7, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 7, 2017

Choose a reason for hiding this comment

Uh oh!

viirya May 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 8, 2017

Uh oh!

viirya commented May 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented May 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan May 8, 2017

Choose a reason for hiding this comment

Uh oh!

viirya commented May 8, 2017

Uh oh!

cloud-fan commented May 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya commented May 6, 2017 •

edited

Loading

viirya May 7, 2017 •

edited

Loading

gatorsmile May 7, 2017 •

edited

Loading

viirya commented May 8, 2017 •

edited

Loading

viirya commented May 8, 2017 •

edited

Loading