Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented May 5, 2017

What changes were proposed in this pull request?

We have a rule in Analyzer that adds missing attributes in a Filter into its child plan. It makes the following codes work:

val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("x", "y")
df.select("y").where("x=1")

It should throw an analysis exception instead of implicitly adding the missing attributes into underlying plan.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@cloud-fan
Copy link
Contributor

is it a bug? IIRC this is intentional, so that the dataframe behavior is consistent with SQL.

@SparkQA
Copy link

SparkQA commented May 5, 2017

Test build #76495 has finished for PR 17874 at commit 010d046.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 6, 2017

The rule is added by #12235. From the description and code comment, it should be just used for HAVING clause that access a grouping column that is not presented in SELECT clause, instead of a general rule to add missing attributes to Filter.

@viirya
Copy link
Member Author

viirya commented May 7, 2017

@cloud-fan This rule could make the query work:

Seq(1).toDF("c1").createOrReplaceTempView("onerow")
sql(
  """
    | select 1
    |        from   (select 1 from onerow t2 LIMIT 1)
    |        where  t2.c1=1""".stripMargin)

But the where condition should not be able to refer t2.c1 which is only available in the inner scope.

@viirya viirya changed the title [SPARK-20612][SQL][WIP] Throw exception when there is unresolvable attributes in Filter [SPARK-20612][SQL] Throw exception when there is unresolvable attributes in Filter May 7, 2017
@SparkQA
Copy link

SparkQA commented May 7, 2017

Test build #76538 has finished for PR 17874 at commit f19976a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* projection, so that they will be available during sorting. Another projection is added to
* remove these attributes after sorting.
*
* The HAVING clause could also used a grouping columns that is not presented in the SELECT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example,

select type, avg (price)
from titles
group by type
having sum (total_sales) > 10000

This example is copied from Sybase ASE. I believe this is part of Transact-SQL

Copy link
Member Author

@viirya viirya May 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design.

For the HAVING clause could also used a grouping columns that is not presented in the SELECT, yes.

For other general cases, I doubt it.

We have other rule doing this (HAVING clause with grouping columns). That is why the tests are passed after this rule is removed. The above query also works without this rule.

Copy link
Member

@gatorsmile gatorsmile May 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we introduced this by accident, I do not think we can remove it now. It could break the applications that are built on it. cc @rxin @cloud-fan @marmbrus

@cloud-fan
Copy link
Contributor

in postgres, select a from t where b > 0 can work, I think it's reasonable if df.select("y").where("x=1") works in spark.

Seq(1).toDF("c1").createOrReplaceTempView("onerow")
sql(
  """
    | select 1
    |        from   (select 1 from onerow t2 LIMIT 1)
    |        where  t2.c1=1""".stripMargin)

this one we should not support, we should not add missing attributes though subqueries.

@viirya
Copy link
Member Author

viirya commented May 8, 2017

select a from t where b > 0 works. However, it can be seen logically as:

Project [a]
  Filter [b > 0]
    Relation t [a, b]

It seems to me Spark also parses the above SQL query like this way.

There is an order of evaluation in SQL systems. E.g, MySQL:

select a from test where b > 2;   // works. `where` is evaluated before `select`.
select a from test having b > 2;  // not works. `having` is evaluated after `select`.

df.select("y").where("x=1") sematically asks a projection of just y attribute before filtering. It seems to me that it is different with the SQL query.

@viirya
Copy link
Member Author

viirya commented May 8, 2017

Maybe another point of view is, we can split df.select("y").where("x=1") to two different DataFrames:

val onlyY = df.select("y")  // The schema of onlyY is just "y" attribute
onlyY.where("x=1") // Then we filter on a non-existing attribute

val model = new FPGrowth().setMinSupport(0.7).fit(dataset)
val prediction = model.transform(df)
assert(prediction.select("prediction").where("id=3").first().getSeq[String](0).isEmpty)
assert(prediction.where("id=3").select("prediction").first().getSeq[String](0).isEmpty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that existing spark applications may already use this pattern in the code, so no matter it's a bug or not, it seems a feature now and we can't break it...

@viirya
Copy link
Member Author

viirya commented May 8, 2017

@cloud-fan Since you all concern about breaking existing applications, I'd close this. But I think we should not add missing attributes though subqueries like I showed above. I'll create another PR to fix it. What do you think?

@cloud-fan
Copy link
Contributor

yea that example looks very weird and we should fix it, thanks!

@viirya viirya closed this May 10, 2017
@viirya viirya deleted the SPARK-20612 branch December 27, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants