-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42896][SQL][PYTHON] Make mapInPandas / mapInArrow support barrier mode execution
#40520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
4de9d07
init
WeichenXu123 f612bcb
update
WeichenXu123 e95fd95
update
WeichenXu123 aa7862f
update
WeichenXu123 9be4cad
update
WeichenXu123 9ebb3f2
Merge branch 'master' into barrier-udf
WeichenXu123 85b840b
fix test
WeichenXu123 66f13cb
update format
WeichenXu123 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next
Next commit
init
Signed-off-by: Weichen Xu <[email protected]>
- Loading branch information
commit 4de9d0791d7de4cdc042fd9f42c510f954c628da
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would require an implementation much complicated than this actually. The current implementation might work for now because these specific physical plans don't move around much for now but the implementation is flaky because the physical plans can change via Catalyst Optimizer (e.g., predicate pushdown) but the barrier execution mode requires that the barrier is created exactly the location where it's invoked.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean pushdown like this ?
to
?
I don't think we should allow pushdown in this case.
MapInPandasexecutes arbitrary user code. Can we modify optimizer code to prevent it change the plan when it finds "is_barrier" set ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature
mapInPandas / mapInArrowsupporting barrier mode is only for ML use case, we don't need to complicate it. We only need to support such scenario:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that the optimizers can might switch the physical plan order in the future, and that's why the implementation is flaky. Now this works because we don't switch the order anywhere else.
To have a complete implementation, we should have something like #19873 which brought a lot of side effect, and it got reverted in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is only for internal purpose, let's probably don't expose
is_barrier: bool = Falseto the API surface at least.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simply the implementation, we can implement a
barrierMapInPandasAndCollectinstead, and define a execution plan stage likeBarrierMapInPandasAndCollectExecThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It should be public API. Third-party lib such as xgboost also need to use it.