-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row #40755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row #40755
Changes from 1 commit
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
58e779d
add test
kings129 2f43ef4
local var
kings129 1809a4d
add back null check in children deserializer
kings129 289a546
fix scala style check
kings129 413b632
remove extra space
kings129 6912c3b
use exiting nullSafe
kings129 2e59114
better naming
kings129 2372d49
keep deserializer outermost null check
kings129 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
local var
- Loading branch information
commit 2f43ef41b6c6f3446db8fefb6cbe0176ed5b1eda
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's kind of we push down the null check to the children deserializers. Why is the serializer fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is intended to create a deserializer type
newinstance(class scala.Tuple*)that can convert to a single null value. This behavior is the same as before the commit introduced the regression.Regarding the serializer, in the new unit test added in this pull request, when the tuple is not null, named_struct is created for each element, and null is handled there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan does my comment answer your question? PTAL, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks correct to me to add the null check for the children deserializers. But I don't quite understand why this PR removes the outermost null check. After looking at the code, I think it doesn't matter, as the outermost null check will be removed anyway: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L274
Since this is unrelated to this PR, let's not touch it. If you do want to fix it (adding null check and removing it later is useless), let's fix the serializer as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, thanks for the explanation! You're right; it doesn't matter whether to keep the outermost null check. (null check for deserializer was also added in refactor commit)
I also prefer making minimal changes to fix the target issue. I added back the outermost null check for the deserializer.