[SPARK-10838][SPARK-11576][SQL][WIP] Incorrect results or exceptions when using self-joins #9548

gatorsmile · 2015-11-08T22:09:42Z

Spark SQL is using expression ID to identify the column references. For self joins, these IDs might not be unique. When resolving the attributeReference's ambiguity caused by self joins, the current solution only handles the conflicting attributes. However, this does not work when the join conditions use the column names that appear in both analyzed dataFrames, since the the columns in join conditions are analyzed before resolving the ambiguity of conflicting attributes. Currently, we did not update the search-condition during ambiguity resolution of attributeReference. When generating the new expression IDs in the right tree, we must update the corresponding columns' expression ID in search condition. This part is missing now.

Here, I am trying to propose a solution to resolve the above issue. When analyzing the columns in the join conditions, we record the dataFrame hashCode of the search-condition columns. By using the information, we can determine which columns are from the right tree, and then update their expression IDs when resolving the ambiguity of conflicting attributes.

When designing this PR, I am trying to avoid introducing a lot of code changes, and thus, I just use quantifiers to record this information, if and only if necessary.

Ideally, I think each column reference always needs to maintain a dedicated identifier for identifying its source dataFrame. The ideal solution requires a lot of code changes, but this can help us further optimize the plan in the future.

Thanks for any suggestion!

gatorsmile · 2015-11-08T22:18:11Z

Since this solution requires adding quantifier comparison into the equation of attributeReferences, this will fail a couple test cases in expand.

We have already identified the bugs in the expand and submitted pull requests to resolve this issue. #9216

SparkQA · 2015-11-08T22:42:32Z

Test build #45319 has finished for PR 9548 at commit 7d04713.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-11-09T03:50:14Z

To fix these failed cases, I will move the dataFrame's hashCode to the Column class as a dedicated field, instead of directly putting the values to quantifiers.

gatorsmile · 2015-11-09T20:28:54Z

I can't fix the problem without a major code change. The current design of dataFrame APIs has a fundamental problem. When using column references, we might hit various strange issues if the dataFrame has the columns with the same name and expression id. Note that this might occur even if we do not have self joins.

For example, in the following code,

    val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2")
    val df2 = Seq((1, 4, 0), (2, 1, 0)).toDF("keyCol1", "keyCol3", "keyColToDrop")
    val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1"))

    val col = df3("keyColToDrop")
    val df = df2.drop(col)
    df.printSchema()

Above, we can use a column reference of df3 to drop the column in df2. That does not make sense, right?

In each column reference, we have to know the data source.

@marmbrus @rxin @liancheng @yhuai @cloud-fan
Should I propose a solution to fix this problem?

marmbrus · 2015-11-09T22:20:35Z

That particular example does not really seem like a problem to me. Its the same column logically that you are dropping.

What if we make the change in the Column API only, instead of trying to change AttributeReference. Perhaps we can assign a globally unique ID to any Dataframe (similar to expression ID). This can be optionally added to any column that is created directly from a dataframe.

gatorsmile · 2015-11-09T23:01:35Z

@marmbrus Thank you for your suggestions!

That is also like my initial idea. I did a try last night. Unfortunately, I hit a problem when adding such a field to Column API.

In the current design, the class Column corresponds to the class Expression, which includes both AttributeReference and the other types. For Column, it makes sense to have such a dataFrame identifier. However, when Column is generated from the binary expression types (e.g., gt), it could have more than one dataFrame identifiers. Does that sound good to you?

When implementing the idea, it becomes more difficult. For example, in the following binary operators,

  def === (other: Any): Column = {
    val right = lit(other).expr
    EqualTo(expr, right)
  }

EqualTo is an Expression. expr and right are not Columns. Thus, when accessing the Column generated from ===, we are unable to know the dataframe sources of expr and right if we do not change AttributeReference. If we ignore the binary operators, we are unable to resolve the AttributeReference ambiguity caused by self joins. : (

That is why I am thinking this could mean a major code change to DataFrame and Column. Thank you for any further suggestion.

cloud-fan · 2015-11-11T10:04:45Z

I don't think every Column need to belong to a DataFrame, and it's ok to me to have exactly same Column from different DataFrames, e.g. the keyColToDrop in your example.

About the problem that we resolve right tree of self join but miss the join codition, actually it's a known bug, a workaround is aliasing a name to DataFrames and use $"df.col" in join condition so that it's unresolved while we resoving the right tree.

Making every Column globally unique is a good idea, but adding a DataFrame ID maybe too much because self-join is a special case and the only case that introduce ambiguity. How about we create new Columns with new expression IDs only when calling DataFrame.as?

gatorsmile · 2015-11-11T16:28:15Z

@cloud-fan Before discussing the solution details, let us first talk about the design issues.

IMO, the DataFrame is a query language, kind of a dialect of SQL. Or, maybe, SQL is a dialect of DataFrame. We need to formalize it and clearly define the concepts of each major classes like DataFrame and Column. If Column represents a concept independent of DataFrame, could you define what it is? If one Column with the same ID can appear in different DataFrame, how to enforce such a "referential integrity" between different DataFrame? If two Column with different ID could represent the same entity, should we keep such a relation for generating a better physical plan?

In the current implementation, each Column actually corresponds to an expression in logical plans, but we are unable to apply an expression above Column instances to generate a new expression. So far, Column is kind of a wrapper of Expression, but it is not a subclass of TreeNode.

When more components are built on DataFrame to access and operate, we have to carefully think about this problem. If possible, I think we need to resolve it in the release of Spark 2.0.

Will answer your design suggestion in a separate post.

gatorsmile · 2015-11-11T21:12:47Z

@cloud-fan So far, we do not have an easy fix, but I believe we should never return a wrong result for self join.

Let me post the test case I added. This test case will return an incorrect result without any exception:

test("[SPARK-10838] self join - conflicting attributes in condition - incorrect result 2") {
   val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2")
   val df2 = Seq((1, 4), (2, 1)).toDF("keyCol1", "keyCol3")
   val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1")).select(df1("keyCol1"), $"keyCol3")
   checkAnswer(
     df3.join(df1, df3("keyCol3") === df1("keyCol1") && df1("keyCol1") === df3("keyCol3")),
     Row(2, 1, 1, 3) :: Nil)
 }

Before resolving this problem, what we can do it is to detect it and let customers use the workaround you mentioned. The detection condition is simple. The incorrect result could happen when the conflicting attributes contain the AttributeReference that appear in the join condition. That will be done in Analyzer.

Do you agree @cloud-fan @marmbrus ?

If OK, I will submit another PR for detecting it and issuing an exception with a meaningful message to users.

rxin · 2016-01-12T01:47:40Z

@gatorsmile we will revisit this in the future. Do you mind closing the pull request for now?

gatorsmile · 2016-01-12T01:53:11Z

Ok, let me close it. Thank you!

gatorspark added 2 commits November 8, 2015 10:28

Spark-10838

376691a

Merge branch 'selfJoinCondition' into selfJoinConflictingConditions

7d04713

yzhou2001 mentioned this pull request Dec 23, 2015

[SPARK-10838][SPARK-11576][SQL] Incorrect results or exceptions when using self-joins #10450

Closed

gatorsmile closed this Jan 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-10838][SPARK-11576][SQL][WIP] Incorrect results or exceptions when using self-joins #9548

[SPARK-10838][SPARK-11576][SQL][WIP] Incorrect results or exceptions when using self-joins #9548

Uh oh!

gatorsmile commented Nov 8, 2015

Uh oh!

gatorsmile commented Nov 8, 2015

Uh oh!

SparkQA commented Nov 8, 2015

Uh oh!

gatorsmile commented Nov 9, 2015

Uh oh!

gatorsmile commented Nov 9, 2015

Uh oh!

marmbrus commented Nov 9, 2015

Uh oh!

gatorsmile commented Nov 9, 2015

Uh oh!

cloud-fan commented Nov 11, 2015

Uh oh!

gatorsmile commented Nov 11, 2015

Uh oh!

gatorsmile commented Nov 11, 2015

Uh oh!

rxin commented Jan 12, 2016

Uh oh!

gatorsmile commented Jan 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-10838][SPARK-11576][SQL][WIP] Incorrect results or exceptions when using self-joins #9548

[SPARK-10838][SPARK-11576][SQL][WIP] Incorrect results or exceptions when using self-joins #9548

Uh oh!

Conversation

gatorsmile commented Nov 8, 2015

Uh oh!

gatorsmile commented Nov 8, 2015

Uh oh!

SparkQA commented Nov 8, 2015

Uh oh!

gatorsmile commented Nov 9, 2015

Uh oh!

gatorsmile commented Nov 9, 2015

Uh oh!

marmbrus commented Nov 9, 2015

Uh oh!

gatorsmile commented Nov 9, 2015

Uh oh!

cloud-fan commented Nov 11, 2015

Uh oh!

gatorsmile commented Nov 11, 2015

Uh oh!

gatorsmile commented Nov 11, 2015

Uh oh!

rxin commented Jan 12, 2016

Uh oh!

gatorsmile commented Jan 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants