Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Oct 27, 2023

What changes were proposed in this pull request?

This pr change to use LazyList instead of Stream due to Stream has been marked as deprecated after Scala 2.13.0.

  • class Stream
@deprecated("Use LazyList (which is fully lazy) instead of Stream (which has a lazy tail only)", "2.13.0")
@SerialVersionUID(3L)
sealed abstract class Stream[+A] extends AbstractSeq[A]
  with LinearSeq[A]
  with LinearSeqOps[A, Stream, Stream[A]]
  with IterableFactoryDefaults[A, Stream]
  with Serializable {
  ...
  @deprecated("The `append` operation has been renamed `lazyAppendedAll`", "2.13.0")
  @inline final def append[B >: A](rest: => IterableOnce[B]): Stream[B] = lazyAppendedAll(rest)
  • object Stream
@deprecated("Use LazyList (which is fully lazy) instead of Stream (which has a lazy tail only)", "2.13.0")
@SerialVersionUID(3L)
object Stream extends SeqFactory[Stream] {
  • type Stream and value Stream
@deprecated("Use LazyList instead of Stream", "2.13.0")
type Stream[+A] = scala.collection.immutable.Stream[A]
@deprecated("Use LazyList instead of Stream", "2.13.0")
val Stream = scala.collection.immutable.Stream
  • method toStream in trait IterableOnceOps
@deprecated("Use .to(LazyList) instead of .toStream", "2.13.0")
@inline final def toStream: immutable.Stream[A] = to(immutable.Stream)

Why are the changes needed?

Clean up deprecated Scala API usage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GitHub Acitons

Was this patch authored or co-authored using generative AI tooling?

No

@LuciferYang LuciferYang marked this pull request as draft October 27, 2023 14:56
@LuciferYang
Copy link
Contributor Author

Test first, will update pr description later

@LuciferYang LuciferYang marked this pull request as ready for review October 29, 2023 12:56
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @LuciferYang .
Merged to master for Apache Spark 4.0.0.

@LuciferYang
Copy link
Contributor Author

Thanks @dongjoon-hyun

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM later.

@LuciferYang
Copy link
Contributor Author

Thanks @beliefer ~

Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this PR may have introduced a rare performance / StackOverflowError regression:

Even though Stream is deprecated in 2.13, it is not removed and thus is is possible that some parts of Spark / Catalyst (or third-party code) might continue to pass around Stream instances, but at call sites like

val inputVars = inputVarsCandidate match {
  case stream: Stream[ExprCode] => stream.force
  case other => other
}

this PR has replaced the pattern matching with

val inputVars = inputVarsCandidate match {
  case stream: LazyList[ExprCode] => stream.force
  case other => other
}

instead of handling both Stream and LazyList, as in

val inputVars = inputVarsCandidate match {
  case stream: Stream[ExprCode] => stream.force
  case stream: LazyList[ExprCode] => stream.force
  case other => other
}

Given this, I think that we should make a followup patch to update all of the modified .force call sites to perform the forcing for both LazyList and Stream, since otherwise we would be losing the eager materialization for Streams that happen to flow to these call sites.

@LuciferYang @dongjoon-hyun , WDYT?

@LuciferYang
Copy link
Contributor Author

@JoshRosen I have submitted a pr: #46970, please let me know if I have correctly understood your suggestion.

If current pr has a significant impact on performance, we can also revert it.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Jun 13, 2024

I think that this PR may have introduced a rare performance / StackOverflowError regression:

I would like to understand more, is this issue due to the design of LazyList itself? Should we submit an issue to the Scala community?

Sorry, please ignore this question, I should have understood what you mean.

LuciferYang added a commit that referenced this pull request Jun 14, 2024
…t.force` is called

### What changes were proposed in this pull request?
Refer to the suggestion of #43563 (review), this pr add handling for Stream where LazyList.force is called

### Why are the changes needed?
Even though `Stream` is deprecated in 2.13, it is not _removed_ and thus is is possible that some parts of Spark / Catalyst (or third-party code) might continue to pass around `Stream` instances. Hence, we should restore the call to `Stream.force` where `.force` is called on `LazyList`, to avoid losing the eager materialization for Streams that happen to flow to these call sites. This is also a guarantee of compatibility.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add some new tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46970 from LuciferYang/SPARK-45685-FOLLOWUP.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants