Skip to content

Conversation

@srowen
Copy link
Member

@srowen srowen commented Jul 16, 2017

…build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure

What changes were proposed in this pull request?

This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.

In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.

It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the latest maintenance release in the earliest viable minor release.

  • Scalatest 2.x -> 3.0.3
  • Chill 0.8.0 -> 0.8.4
  • Clapper 1.0.x -> 1.1.2
  • json4s 3.2.x -> 3.4.2
  • Jackson 2.6.x -> 2.7.9 (required by json4s)

This change does not fully enable a Scala 2.12 build:

  • It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
  • It will require recreating SparkILoop and Main for REPL 2.12, which is SPARK-14650. Possible to do here too.

What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.

How was this patch tested?

Existing tests and build. Manually tested with ./dev/change-scala-version.sh 2.12 to verify it compiles, modulo the exceptions above.

@srowen
Copy link
Member Author

srowen commented Jul 16, 2017

CC @rxin since you asked; CC @JoshRosen
Just a WIP now of course, but wanted to table a take on most of the remaining gap we can bridge right now towards 2.12.

@SparkQA
Copy link

SparkQA commented Jul 16, 2017

Test build #79643 has finished for PR 18645 at commit ef1c221.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2017

Test build #79645 has finished for PR 18645 at commit 1204077.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cause a MiMa warning, because they're new methods in a trait that might be extended by users. I'll have to go back and remember whether this is an actual problem, because the trait is providing a default implementation here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried compiling a small app that calls RDD.countAsync (which returns a FutureAction) and even implements a custom FutureAction and compiled vs 2.2.0, then ran vs this build, and it worked. I believe this may be legitimately excluded from MiMa.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes like this resolve an ambiguity where two overloads of a method exist, one with a signature taking a lambda, and the other taking a trait/interface of one method, both of which could fit. In cases like this I resolved in favor of implementing a specific listener class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this a source breaking change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can avoid source incompatibilities for Scala users by removing the overloads which accept Scala functions and then adding in a package-level implicit conversion to convert from Scala functions back into our own custom trait / interface.

The trickiness here is that you need to preserve binary compatibility on Scala 2.10/2.11, so the removal of the overload needs to be done conditionally so that it only occurs when building with Scala 2.12. Rather than having a separate source tree for 2.12, I'd propose defining the removed overload in a mixin trait which comes from a separate source file and then configure the build to use different versions of that file for 2.10/2.11 and for 2.12.

For details on this proposal, see https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit, a document that I wrote in March 2016 which explores these source incompatibility difficulties.

Applying that idea here, the idea would be to remove the method

def addTaskCompletionListener(f: (TaskContext) => Unit)

and add a package-level implicit conversion from TaskContext => Unit to TaskCompletionListener, but to do this only in the 2.12 source tree / shim. This approach has some caveats and could potentially impact Java users who are doing weird things (violating the goal that Java Spark code is source and binary compatible with all Scala versions). See the linked doc for a full discussion of this problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some stuff like this became necessary due to other dependency changes and changes in their transitive dependencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this could have been removed with Scala 2.10 and will be, separately, shortly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases like this I resolved the ambiguity by casting the argument, instead of implementing an interface -- because in cases like this the interface version exists just for Java support.

sql/hive/pom.xml Outdated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related, just found this redundancy while editing. The exact same stanza appears below.

@SparkQA
Copy link

SparkQA commented Jul 22, 2017

Test build #79851 has finished for PR 18645 at commit fec76be.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this a source breaking change?

Copy link
Member Author

@srowen srowen Jul 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just clarifying to the compiler which of two overloads to invoke, because two actually match and it's an error in 2.12.

It's going to be a problem for end-user apps in exactly the same way. They would need similar changes to work with Spark in Scala 2.12 (and, probably, other code with similar characteristics in 2.12)

This is really https://issues.apache.org/jira/browse/SPARK-14643
The thing is, different Scala versions have always been 'source incompatible' too, so that much may not be a big deal.

It would be a big problem if it caused Scala 2.11 apps to change, but this one won't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no not the same issue as https://issues.apache.org/jira/browse/SPARK-14643 but similar. Let me ask the question there about how much source compatibility matters across Scala versions, for Java.

I'm not in any event proposing to enable a 2.12 build anytime soon, just see if we can get it closer without any implications for 2.11

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is primarily going to be an issue for end users who want to use an existing source tree to cross-compile for Scala 2.10, 2.11, and 2.12. Thus the pain of the source incompatibility would mostly be felt by library/package maintainers but it can be worked around as long as there's at least some common subset which is source compatible across all of those versions.

A similar ambiguity issue exists for people writing Spark code in Java 8 and who use lambda syntax. The number of such users must also upgrade to Scala 2.12 is probably fairly small and the number of those users who have a requirement to cross-build their Java 8 lambda syntax Spark code against 2.11 and 2.12 is probably even smaller, so maybe this isn't a huge deal to accept the ambiguity and require those users to do a bit of work when upgrading.

@rxin
Copy link
Contributor

rxin commented Jul 22, 2017

@srowen You just showed that the Scala 2.12 changes are source breaking, isn't it?

@srowen
Copy link
Member Author

srowen commented Jul 22, 2017

@rxin (See above for what I think you are referring to.)

I do now see tests failing that diff JSON, but that's probably because the Jackson version had to change. (This is a WIP.) And I need to figure out whether that's effectively a breaking change or not.

@rxin
Copy link
Contributor

rxin commented Jul 23, 2017

It is still source breaking change, and this is why I was saying it would be a lot of work to upgrade to Scala 2.12 without breaking existing source code. For 2.12 we should get rid of the functions that are problematic, using build profile.

@srowen
Copy link
Member Author

srowen commented Jul 23, 2017

@rxin it's definitely breaking between Scala 2.11 and Scala 2.12, but not for the Scala 2.11 build. Ideally, we find it's possible to enable a 2.12 build, such that the changes are binary and source compatible within the Scala 2.11 builds. Otherwise, it'd be impossible to merge this before 3.0.0. But if it doesn't affect 2.11 users there's a much better argument for adding the build in a minor release, as Scala versions have never been mutually compatible.

@srowen
Copy link
Member Author

srowen commented Jul 23, 2017

The JSON error arises because now, JSON messages contain an explicit entry for null properties, like "message" : null. This looks like a small behavior change introduced in Jackson 2.7. See https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7 Although it's the same, semantically, in theory, we should make the behavior identical.

It looks like several places in the code explicitly configure Jackson to only include NON_NULL properties. I believe it needs to be extended to be set explicitly elsewhere now. Still investigating ...

@rxin
Copy link
Contributor

rxin commented Jul 23, 2017

@srowen I don't agree that we should just break source compatibility here. We have already spent a lot of time doing this in the past and figuring out how to preserve it.

@srowen
Copy link
Member Author

srowen commented Jul 23, 2017

Are we talking about the same thing? this does not break source compatibility for Scala 2.11. Scala 2.12, yes. But that has never been binary, or necessarily source, compatible. What's different?

@rxin
Copy link
Contributor

rxin commented Jul 23, 2017

When users upgrade from 2.11 to 2.12, their app would be broken, wouldn't it?

@srowen
Copy link
Member Author

srowen commented Jul 23, 2017

That's not an upgrade. spark_2.11 2.2.0 to spark_2.11 2.3.0 is an upgrade, and that can't break, and doesn't with this change.

Or: Scala itself has never been source- or binary-compatible across Scala minor releases, and neither has Spark across Scala versions. Compiling vs spark_2.10 and running on spark_2.11 doesn't work; cross compiling I don't think even works in all cases. I don't follow why that's an issue here.

In fact, we already know it can't be source compatible, for Java callers: https://issues.apache.org/jira/browse/SPARK-14643 That may well be the argument that this can't happen until 3.0. Check in with @JoshRosen for more context.

However, I am not suggesting creating any Scala 2.12 release at all anytime soon here, so I don't think that's worth debating. The question of whether's it's source compatible is moot because this creates no new type of release to be compatible or not. I don't even disagree with someone who argues that Scala 2.12 support can't happen before 3.0.

It does make a lot of the changes that will have to be made for 2.12 though, without disrupting the current 2.11 build. The question is really whether that statement is true -- the Jackson thing is the problem right now.

Why would we bother implementing most of the groundwork for 2.12 without making an official 2.12 build? generally, to break down the problem and address this incrementally rather than in one huge change. But also to make it possible for a brave soul out there, like maybe the folks at Twitter, to run their own 2.12 build if they wanted to self-support.

More concretely, if this went in to a 2.3.0 release tomorrow, what goes wrong?

@SparkQA
Copy link

SparkQA commented Jul 23, 2017

Test build #79892 has finished for PR 18645 at commit 59b46ef.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

JoshRosen commented Jul 23, 2017

Looking at the source compatibility changes made here, I was a little confused about why we didn't need to make many more changes. In principle, it seemed like the ambiguous overload issue affecting that foreachPartition call should have also impacted calls like .filter(_ > 0), so I was originally expecting that we'd have to update hundreds of call sites.

It turns out that the final Scala 2.12 release has a nice feature which resolves this ambiguity in many cases. In the release notes at https://www.scala-lang.org/news/2.12.0/#lambda-syntax-for-sam-types, there's an example where even though there's a potentially ambiguous overload, "overloading resolution selects the one with the Function1 argument type". This is explained in more detail at https://www.scala-lang.org/news/2.12.0/#sam-conversion-in-overloading-resolution.

Quoting:

In Scala 2.12, both methods are applicable, therefore overloading resolution needs to pick the most specific alternative. The specification for type compatibility has been updated to consider SAM conversion, so that the first alternative is more specific.

This explains why we didn't have to update all call sites. However, why did the .foreachPartition example break? I played around with this in the Scastie Scala paste bin and I think that I've spotted the problem: all of the cases where we had ambiguity seem to be cases where we're operating on Datasets in a generic way and have a type parameter representing the Dataset's type.

So, this means that

def f(ds: Dataset[Int]): Dataset[Int] = ds.filter(_ != null)

is unambiguous but

def f[T](ds: Dataset[T]): Dataset[T] = ds.filter(_ != null)

fails with a compiler error in Scala 2.12. You can try this out at https://scastie.scala-lang.org/JoshRosen/eBcxUGdORsiOQNULMUnAsw.

I took a look at the sections of the language spec linked in the quote above but it's not immediately clear to me whether this is a compiler bug or expected behavior (I'll have to take some more time to try to understand the spec). My hunch is that it has to do with type inference, e.g. the compiler can't figure out which method to call until it knows the function's type and it can't resolve the function's type until it knows which method it's applying to. That makes intuitive sense to me, but I'm not sure if that's the actual reason.

@srowen
Copy link
Member Author

srowen commented Jul 26, 2017

I believe the last errors were essentially the same issue discussed in json4s here:
json4s/json4s#227

I suspect that it's down to a change in the effective setting of Jackson when reading or writing JSON: https://fasterxml.github.io/jackson-databind/javadoc/2.7/com/fasterxml/jackson/databind/DeserializationFeature.html#ACCEPT_SINGLE_VALUE_AS_ARRAY or https://fasterxml.github.io/jackson-databind/javadoc/2.7/com/fasterxml/jackson/databind/SerializationFeature.html#WRITE_SINGLE_ELEM_ARRAYS_UNWRAPPED

It doesn't look like the value's defaults changed from Jackson 2.6 to 2.7, and don't see it overridden in json4s, though it did see some changes in 3.3 that cause it to ignore some Jackson defaults.

In any event I thought it safer to handle both scenarios in the tests where they matter. Let's see if that works, first.

@SparkQA
Copy link

SparkQA commented Jul 26, 2017

Test build #79968 has finished for PR 18645 at commit 82b51dc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuwei-k
Copy link

xuwei-k commented Jul 29, 2017

@srowen
Copy link
Member Author

srowen commented Jul 29, 2017

Oh nice, it will help if that's not a required update along with everything else here as it requires a Jackson update. The Jackson update may be important eventually but nice if it can be separate.

@SparkQA
Copy link

SparkQA commented Jul 30, 2017

Test build #80057 has finished for PR 18645 at commit bc419a0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Jul 31, 2017

The json4s change above made this change notably simpler. The current problem is the same, next error:

sbt.ForkMain$ForkError: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.sql.UDT$MyDenseVector
	at org.apache.spark.sql.UserDefinedTypeSuite$$anonfun$10.apply(UserDefinedTypeSuite.scala:211)
	at org.apache.spark.sql.UserDefinedTypeSuite$$anonfun$10.apply(UserDefinedTypeSuite.scala:205)
	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
	at org.apache.spark.sql.UserDefinedTypeSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(UserDefinedTypeSuite.scala:146)
	at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
	at org.apache.spark.sql.UserDefinedTypeSuite.runTest(UserDefinedTypeSuite.scala:146)

See comments on that test:

3ae25f2#commitcomment-23320234
3ae25f2#commitcomment-23341672

I'm current looking into Michael's suggestion in the second case.

@SparkQA
Copy link

SparkQA commented Jul 31, 2017

Test build #80074 has finished for PR 18645 at commit dc2cdb4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81037 has finished for PR 18645 at commit 0c81fdd.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2017

Test build #81143 has finished for PR 18645 at commit 273dbdb.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2017

Test build #3905 has finished for PR 18645 at commit 273dbdb.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2017

Test build #81184 has finished for PR 18645 at commit 19ee35e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
@SparkQA
Copy link

SparkQA commented Aug 30, 2017

Test build #81255 has finished for PR 18645 at commit f6b0a4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Aug 30, 2017

The status here is that this change does work fine with Scala 2.11, and does implement changes that will be needed for 2.12. Right now I've found that, suddenly, it doesn't compile for Scala 2.12:

[error] /Users/srowen/Documents/Cloudera/spark/core/src/test/scala/org/apache/spark/FileSuite.scala:100: could not find implicit value for parameter kcf: () => org.apache.spark.WritableConverter[org.apache.hadoop.io.IntWritable]
[error] Error occurred in an application involving default arguments.
[error]     val output = sc.sequenceFile[IntWritable, Text](outputDir)
[error]                                                    ^
[error] /Users/srowen/Documents/Cloudera/spark/core/src/test/scala/org/apache/spark/FileSuite.scala:131: could not find implicit value for parameter kcf: () => org.apache.spark.WritableConverter[org.apache.hadoop.io.IntWritable]
[error] Error occurred in an application involving default arguments.
[error]     val output = sc.sequenceFile[IntWritable, Text](outputDir)

I didn't see this with earlier 2.12, I believe, so could be some change or problem in implicit resolution.

Of course, the 2.12 build never was going to work fully after this change anyway (Kafka 0.8 needs to go, REPL needs to update) so maybe worth going for anyway.

@srowen
Copy link
Member Author

srowen commented Sep 1, 2017

Merged to master

@asfgit asfgit closed this in 12ab7f7 Sep 1, 2017
@srowen srowen deleted the SPARK-14280 branch September 6, 2017 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants