-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16818] Exchange reuse incorrectly reuses scans over different sets of partitions #14425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| def getPlan(df: DataFrame): SparkPlan = { | ||
| df.queryExecution.executedPlan | ||
| } | ||
| assert(getPlan(df.where("id = 2")).sameResult(getPlan(df.where("id = 2")))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you verify this would fail without your patch?
|
LGTM (assuming the test case would fail without the fix) |
|
Yep, both fail prior to the fix. On Sat, Jul 30, 2016, 3:32 PM Reynold Xin [email protected] wrote:
|
|
Test build #63047 has finished for PR 14425 at commit
|
|
Merging in master/2.0. |
|
@ericl there is a conflict with branch-2.0. Can you create a pull request for branch-2.0? |
…sets of partitions This fixes a bug wherethe file scan operator does not take into account partition pruning in its implementation of `sameResult()`. As a result, executions may be incorrect on self-joins over the same base file relation. The patch here is minimal, but we should reconsider relying on `metadata` for implementing sameResult() in the future, as string representations may not be uniquely identifying. cc rxin Unit tests. Author: Eric Liang <[email protected]> Closes apache#14425 from ericl/spark-16818. Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
|
Done, see #14427 |
…sets of partitions #14425 rebased for branch-2.0 Author: Eric Liang <[email protected]> Closes #14427 from ericl/spark-16818-br-2.
What changes were proposed in this pull request?
This fixes a bug wherethe file scan operator does not take into account partition pruning in its implementation of
sameResult(). As a result, executions may be incorrect on self-joins over the same base file relation.The patch here is minimal, but we should reconsider relying on
metadatafor implementing sameResult() in the future, as string representations may not be uniquely identifying.cc @rxin
How was this patch tested?
Unit tests.