Skip to content

Conversation

@gatorsmile
Copy link
Member

Spark SQL is using expression ID to identify the column references. For self joins, these IDs might not be unique. When resolving the attributeReference's ambiguity caused by self joins, the current solution only handles the conflicting attributes. However, this does not work when the join conditions use the column names that appear in both analyzed dataFrames, since the the columns in join conditions are analyzed before resolving the ambiguity of conflicting attributes. Currently, we did not update the search-condition during ambiguity resolution of attributeReference. When generating the new expression IDs in the right tree, we must update the corresponding columns' expression ID in search condition. This part is missing now.

Here, I am trying to propose a solution to resolve the above issue. When analyzing the columns in the join conditions, we record the dataFrame hashCode of the search-condition columns. By using the information, we can determine which columns are from the right tree, and then update their expression IDs when resolving the ambiguity of conflicting attributes.

When designing this PR, I am trying to avoid introducing a lot of code changes, and thus, I just use quantifiers to record this information, if and only if necessary.

Ideally, I think each column reference always needs to maintain a dedicated identifier for identifying its source dataFrame. The ideal solution requires a lot of code changes, but this can help us further optimize the plan in the future.

Thanks for any suggestion!

@gatorsmile
Copy link
Member Author

Since this solution requires adding quantifier comparison into the equation of attributeReferences, this will fail a couple test cases in expand.

We have already identified the bugs in the expand and submitted pull requests to resolve this issue. #9216

@SparkQA
Copy link

SparkQA commented Nov 8, 2015

Test build #45319 has finished for PR 9548 at commit 7d04713.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

To fix these failed cases, I will move the dataFrame's hashCode to the Column class as a dedicated field, instead of directly putting the values to quantifiers.

@gatorsmile
Copy link
Member Author

I can't fix the problem without a major code change. The current design of dataFrame APIs has a fundamental problem. When using column references, we might hit various strange issues if the dataFrame has the columns with the same name and expression id. Note that this might occur even if we do not have self joins.

For example, in the following code,

    val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2")
    val df2 = Seq((1, 4, 0), (2, 1, 0)).toDF("keyCol1", "keyCol3", "keyColToDrop")
    val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1"))

    val col = df3("keyColToDrop")
    val df = df2.drop(col)
    df.printSchema() 

Above, we can use a column reference of df3 to drop the column in df2. That does not make sense, right?

In each column reference, we have to know the data source.

@marmbrus @rxin @liancheng @yhuai @cloud-fan
Should I propose a solution to fix this problem?

@marmbrus
Copy link
Contributor

marmbrus commented Nov 9, 2015

That particular example does not really seem like a problem to me. Its the same column logically that you are dropping.

What if we make the change in the Column API only, instead of trying to change AttributeReference. Perhaps we can assign a globally unique ID to any Dataframe (similar to expression ID). This can be optionally added to any column that is created directly from a dataframe.

@gatorsmile
Copy link
Member Author

@marmbrus Thank you for your suggestions!

That is also like my initial idea. I did a try last night. Unfortunately, I hit a problem when adding such a field to Column API.

In the current design, the class Column corresponds to the class Expression, which includes both AttributeReference and the other types. For Column, it makes sense to have such a dataFrame identifier. However, when Column is generated from the binary expression types (e.g., gt), it could have more than one dataFrame identifiers. Does that sound good to you?

When implementing the idea, it becomes more difficult. For example, in the following binary operators,

  def === (other: Any): Column = {
    val right = lit(other).expr
    EqualTo(expr, right)
  }

EqualTo is an Expression. expr and right are not Columns. Thus, when accessing the Column generated from ===, we are unable to know the dataframe sources of expr and right if we do not change AttributeReference. If we ignore the binary operators, we are unable to resolve the AttributeReference ambiguity caused by self joins. : (

That is why I am thinking this could mean a major code change to DataFrame and Column. Thank you for any further suggestion.

@cloud-fan
Copy link
Contributor

I don't think every Column need to belong to a DataFrame, and it's ok to me to have exactly same Column from different DataFrames, e.g. the keyColToDrop in your example.

About the problem that we resolve right tree of self join but miss the join codition, actually it's a known bug, a workaround is aliasing a name to DataFrames and use $"df.col" in join condition so that it's unresolved while we resoving the right tree.

Making every Column globally unique is a good idea, but adding a DataFrame ID maybe too much because self-join is a special case and the only case that introduce ambiguity. How about we create new Columns with new expression IDs only when calling DataFrame.as?

@gatorsmile
Copy link
Member Author

@cloud-fan Before discussing the solution details, let us first talk about the design issues.

IMO, the DataFrame is a query language, kind of a dialect of SQL. Or, maybe, SQL is a dialect of DataFrame. We need to formalize it and clearly define the concepts of each major classes like DataFrame and Column. If Column represents a concept independent of DataFrame, could you define what it is? If one Column with the same ID can appear in different DataFrame, how to enforce such a "referential integrity" between different DataFrame? If two Column with different ID could represent the same entity, should we keep such a relation for generating a better physical plan?

In the current implementation, each Column actually corresponds to an expression in logical plans, but we are unable to apply an expression above Column instances to generate a new expression. So far, Column is kind of a wrapper of Expression, but it is not a subclass of TreeNode.

When more components are built on DataFrame to access and operate, we have to carefully think about this problem. If possible, I think we need to resolve it in the release of Spark 2.0.

Will answer your design suggestion in a separate post.

@gatorsmile
Copy link
Member Author

@cloud-fan So far, we do not have an easy fix, but I believe we should never return a wrong result for self join.

Let me post the test case I added. This test case will return an incorrect result without any exception:

test("[SPARK-10838] self join - conflicting attributes in condition - incorrect result 2") {
   val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2")
   val df2 = Seq((1, 4), (2, 1)).toDF("keyCol1", "keyCol3")
   val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1")).select(df1("keyCol1"), $"keyCol3")
   checkAnswer(
     df3.join(df1, df3("keyCol3") === df1("keyCol1") && df1("keyCol1") === df3("keyCol3")),
     Row(2, 1, 1, 3) :: Nil)
 }

Before resolving this problem, what we can do it is to detect it and let customers use the workaround you mentioned. The detection condition is simple. The incorrect result could happen when the conflicting attributes contain the AttributeReference that appear in the join condition. That will be done in Analyzer.

Do you agree @cloud-fan @marmbrus ?

If OK, I will submit another PR for detecting it and issuing an exception with a meaningful message to users.

@rxin
Copy link
Contributor

rxin commented Jan 12, 2016

@gatorsmile we will revisit this in the future. Do you mind closing the pull request for now?

@gatorsmile
Copy link
Member Author

Ok, let me close it. Thank you!

@gatorsmile gatorsmile closed this Jan 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants