-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-45685][CORE][SQL] Use LazyList instead of Stream
#43563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test first, will update pr description later |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @LuciferYang .
Merged to master for Apache Spark 4.0.0.
|
Thanks @dongjoon-hyun |
beliefer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM later.
|
Thanks @beliefer ~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this PR may have introduced a rare performance / StackOverflowError regression:
Even though Stream is deprecated in 2.13, it is not removed and thus is is possible that some parts of Spark / Catalyst (or third-party code) might continue to pass around Stream instances, but at call sites like
val inputVars = inputVarsCandidate match {
case stream: Stream[ExprCode] => stream.force
case other => other
}this PR has replaced the pattern matching with
val inputVars = inputVarsCandidate match {
case stream: LazyList[ExprCode] => stream.force
case other => other
}instead of handling both Stream and LazyList, as in
val inputVars = inputVarsCandidate match {
case stream: Stream[ExprCode] => stream.force
case stream: LazyList[ExprCode] => stream.force
case other => other
}Given this, I think that we should make a followup patch to update all of the modified .force call sites to perform the forcing for both LazyList and Stream, since otherwise we would be losing the eager materialization for Streams that happen to flow to these call sites.
@LuciferYang @dongjoon-hyun , WDYT?
|
@JoshRosen I have submitted a pr: #46970, please let me know if I have correctly understood your suggestion. If current pr has a significant impact on performance, we can also revert it. |
Sorry, please ignore this question, I should have understood what you mean. |
…t.force` is called ### What changes were proposed in this pull request? Refer to the suggestion of #43563 (review), this pr add handling for Stream where LazyList.force is called ### Why are the changes needed? Even though `Stream` is deprecated in 2.13, it is not _removed_ and thus is is possible that some parts of Spark / Catalyst (or third-party code) might continue to pass around `Stream` instances. Hence, we should restore the call to `Stream.force` where `.force` is called on `LazyList`, to avoid losing the eager materialization for Streams that happen to flow to these call sites. This is also a guarantee of compatibility. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add some new tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46970 from LuciferYang/SPARK-45685-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: yangjie01 <[email protected]>
What changes were proposed in this pull request?
This pr change to use
LazyListinstead ofStreamdue toStreamhas been marked as deprecated after Scala 2.13.0.class Streamobject Streamtype Streamand value StreamtoStreamin traitIterableOnceOpsWhy are the changes needed?
Clean up deprecated Scala API usage.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Pass GitHub Acitons
Was this patch authored or co-authored using generative AI tooling?
No