-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9043] Serialize key, value and combiner classes in ShuffleDependency #7403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
ed1afac
eccb0ed
2906e74
adcdfaf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
…ndency ShuffleManager implementations are currently not given type information for the key, value and combiner classes. Serialization of shuffle objects relies on objects being JavaSerializable, with methods defined for reading/writing the object or, alternatively, serialization via Kryo which uses reflection. Serialization systems like Avro, Thrift and Protobuf generate classes with zero argument constructors and explicit schema information (e.g. IndexedRecords in Avro have get, put and getSchema methods). By serializing the key, value and combiner class names in ShuffleDependency, shuffle implementations will have access to schema information when registerShuffle() is called.
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,6 +17,8 @@ | |
|
|
||
| package org.apache.spark | ||
|
|
||
| import scala.reflect.ClassTag | ||
|
|
||
| import org.apache.spark.annotation.DeveloperApi | ||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.serializer.Serializer | ||
|
|
@@ -65,7 +67,7 @@ abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] { | |
| * @param mapSideCombine whether to perform partial aggregation (also known as map-side combine) | ||
| */ | ||
| @DeveloperApi | ||
| class ShuffleDependency[K, V, C]( | ||
| class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( | ||
| @transient private val _rdd: RDD[_ <: Product2[K, V]], | ||
| val partitioner: Partitioner, | ||
| val serializer: Option[Serializer] = None, | ||
|
|
@@ -76,6 +78,15 @@ class ShuffleDependency[K, V, C]( | |
|
|
||
| override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]] | ||
|
|
||
| /** | ||
| * The key, value and combiner classes are serialized so that shuffle manager | ||
| * implementation can use the information to build | ||
| */ | ||
| val keyClassName: String = reflect.classTag[K].runtimeClass.getName | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how is this used? it might require the key class to have a 0-arg ctor right?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's an example of how I use this in the Parquet shuffle manager to create a schema for the (key, value) or (key, combiner) pairs for the shuffle files. |
||
| val valueClassName: String = reflect.classTag[V].runtimeClass.getName | ||
| // Note: It's possible that the combiner class tag is null, if the combineByKey | ||
| // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag. | ||
| val combinerClassName: Option[String] = Option(reflect.classTag[C]).map(_.runtimeClass.getName) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these should be
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we make this Another approach would be to change the ..instead of passing that information in the dependency.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andrewor14 Let me know which approach you prefer -- (a) keeping the class names public or (b) changing the The first approach has the advantage that the data types are available everywhere the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's not change the |
||
| val shuffleId: Int = _rdd.context.newShuffleId() | ||
|
|
||
| val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle( | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -57,6 +57,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) | |
| with SparkHadoopMapReduceUtil | ||
| with Serializable | ||
| { | ||
|
|
||
| /** | ||
| * Generic function to combine the elements for each key using a custom set of aggregation | ||
| * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C | ||
|
|
@@ -70,12 +71,13 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) | |
| * In addition, users can control the partitioning of the output RDD, and whether to perform | ||
| * map-side aggregation (if a mapper can produce multiple items with the same key). | ||
| */ | ||
| def combineByKey[C](createCombiner: V => C, | ||
| def combineByKeyWithClassTag[C]( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, why call this something else? Does it not compile if you just called this
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because If you know as way of doing this more cleanly, I would be happy to make that change. |
||
| createCombiner: V => C, | ||
| mergeValue: (C, V) => C, | ||
| mergeCombiners: (C, C) => C, | ||
| partitioner: Partitioner, | ||
| mapSideCombine: Boolean = true, | ||
| serializer: Serializer = null): RDD[(K, C)] = self.withScope { | ||
| serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope { | ||
| require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0 | ||
| if (keyClass.isArray) { | ||
| if (mapSideCombine) { | ||
|
|
@@ -103,13 +105,48 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) | |
| } | ||
|
|
||
| /** | ||
| * Simplified version of combineByKey that hash-partitions the output RDD. | ||
| * This method is here for backward compatibility. It | ||
| * does not provide combiner classtag information to | ||
| * the shuffle. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The java doc should start with a description of what the method does. We should use the old one and add that it exists backward compatibility after the first sentence. |
||
| * | ||
| * @see [[combineByKeyWithClassTag]] | ||
| */ | ||
| def combineByKey[C]( | ||
| createCombiner: V => C, | ||
| mergeValue: (C, V) => C, | ||
| mergeCombiners: (C, C) => C, | ||
| partitioner: Partitioner, | ||
| mapSideCombine: Boolean = true, | ||
| serializer: Serializer = null): RDD[(K, C)] = self.withScope { | ||
| combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, | ||
| partitioner, mapSideCombine, serializer)(null) | ||
| } | ||
|
|
||
| /** | ||
| * This method is here for backward compatibility. It | ||
| * does not provide combiner classtag information to | ||
| * the shuffle. | ||
| * | ||
| * @see [[combineByKeyWithClassTag]] | ||
| */ | ||
| def combineByKey[C](createCombiner: V => C, | ||
| def combineByKey[C]( | ||
| createCombiner: V => C, | ||
| mergeValue: (C, V) => C, | ||
| mergeCombiners: (C, C) => C, | ||
| numPartitions: Int): RDD[(K, C)] = self.withScope { | ||
| combineByKey(createCombiner, mergeValue, mergeCombiners, new HashPartitioner(numPartitions)) | ||
| combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null) | ||
| } | ||
|
|
||
| /** | ||
| * Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD. | ||
| */ | ||
| def combineByKeyWithClassTag[C]( | ||
| createCombiner: V => C, | ||
| mergeValue: (C, V) => C, | ||
| mergeCombiners: (C, C) => C, | ||
| numPartitions: Int)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope { | ||
| combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, | ||
| new HashPartitioner(numPartitions)) | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -133,7 +170,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) | |
|
|
||
| // We will clean the combiner closure later in `combineByKey` | ||
| val cleanedSeqOp = self.context.clean(seqOp) | ||
| combineByKey[U]((v: V) => cleanedSeqOp(createZero(), v), cleanedSeqOp, combOp, partitioner) | ||
| combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v), | ||
| cleanedSeqOp, combOp, partitioner) | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -182,7 +220,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) | |
| val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray)) | ||
|
|
||
| val cleanedFunc = self.context.clean(func) | ||
| combineByKey[V]((v: V) => cleanedFunc(createZero(), v), cleanedFunc, cleanedFunc, partitioner) | ||
| combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v), | ||
| cleanedFunc, cleanedFunc, partitioner) | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -268,7 +307,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) | |
| * "combiner" in MapReduce. | ||
| */ | ||
| def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope { | ||
| combineByKey[V]((v: V) => v, func, func, partitioner) | ||
| combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner) | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -392,7 +431,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) | |
| h1 | ||
| } | ||
|
|
||
| combineByKey(createHLL, mergeValueHLL, mergeHLL, partitioner).mapValues(_.cardinality()) | ||
| combineByKeyWithClassTag(createHLL, mergeValueHLL, mergeHLL, partitioner) | ||
| .mapValues(_.cardinality()) | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -466,7 +506,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) | |
| val createCombiner = (v: V) => CompactBuffer(v) | ||
| val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v | ||
| val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2 | ||
| val bufs = combineByKey[CompactBuffer[V]]( | ||
| val bufs = combineByKeyWithClassTag[CompactBuffer[V]]( | ||
| createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false) | ||
| bufs.asInstanceOf[RDD[(K, Iterable[V])]] | ||
| } | ||
|
|
@@ -565,12 +605,28 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) | |
| } | ||
|
|
||
| /** | ||
| * Simplified version of combineByKey that hash-partitions the resulting RDD using the | ||
| * This method is here for backward compatibility. It | ||
| * does not provide combiner classtag information to | ||
| * the shuffle. | ||
| * | ||
| * @see [[combineByKeyWithClassTag]] | ||
| */ | ||
| def combineByKey[C]( | ||
| createCombiner: V => C, | ||
| mergeValue: (C, V) => C, | ||
| mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope { | ||
| combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null) | ||
| } | ||
|
|
||
| /** | ||
| * Simplified version of combineByKeyWithClassTag that hash-partitions the resulting RDD using the | ||
| * existing partitioner/parallelism level. | ||
| */ | ||
| def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C) | ||
| : RDD[(K, C)] = self.withScope { | ||
| combineByKey(createCombiner, mergeValue, mergeCombiners, defaultPartitioner(self)) | ||
| def combineByKeyWithClassTag[C]( | ||
| createCombiner: V => C, | ||
| mergeValue: (C, V) => C, | ||
| mergeCombiners: (C, C) => C)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope { | ||
| combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, defaultPartitioner(self)) | ||
| } | ||
|
|
||
| /** | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.spark.shuffle | ||
|
|
||
| import org.apache.spark._ | ||
|
|
||
| case class KeyClass() | ||
|
|
||
| case class ValueClass() | ||
|
|
||
| case class CombinerClass() | ||
|
|
||
| class ShuffleDependencySuite extends SparkFunSuite with LocalSparkContext { | ||
|
|
||
| val conf = new SparkConf(loadDefaults = false) | ||
|
|
||
| test("key, value, and combiner classes correct in shuffle dependency without aggregation") { | ||
| sc = new SparkContext("local", "test", conf.clone()) | ||
| val rdd = sc.parallelize(1 to 5, 4) | ||
| .map(key => (KeyClass(), ValueClass())) | ||
| .groupByKey() | ||
| val dep = rdd.dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]] | ||
| assert(!dep.mapSideCombine, "Test requires that no map-side aggregator is defined") | ||
| assert(dep.keyClassName == classOf[KeyClass].getName) | ||
| assert(dep.valueClassName == classOf[ValueClass].getName) | ||
| } | ||
|
|
||
| test("key, value, and combiner classes available in shuffle dependency with aggregation") { | ||
| sc = new SparkContext("local", "test", conf.clone()) | ||
| val rdd = sc.parallelize(1 to 5, 4) | ||
| .map(key => (KeyClass(), ValueClass())) | ||
| .aggregateByKey(CombinerClass())({ case (a, b) => a }, { case (a, b) => a }) | ||
| val dep = rdd.dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]] | ||
| assert(dep.mapSideCombine && dep.aggregator.isDefined, "Test requires map-side aggregation") | ||
| assert(dep.keyClassName == classOf[KeyClass].getName) | ||
| assert(dep.valueClassName == classOf[ValueClass].getName) | ||
| assert(dep.combinerClassName == Some(classOf[CombinerClass].getName)) | ||
| } | ||
|
|
||
| test("combineByKey null combiner class tag handled correctly") { | ||
| sc = new SparkContext("local", "test", conf.clone()) | ||
| val rdd = sc.parallelize(1 to 5, 4) | ||
| .map(key => (KeyClass(), ValueClass())) | ||
| .combineByKey((v: ValueClass) => v, | ||
| (c: AnyRef, v: ValueClass) => c, | ||
| (c1: AnyRef, c2: AnyRef) => c1) | ||
| val dep = rdd.dependencies.head.asInstanceOf[ShuffleDependency[_, _, _]] | ||
| assert(dep.keyClassName == classOf[KeyClass].getName) | ||
| assert(dep.valueClassName == classOf[ValueClass].getName) | ||
| assert(dep.combinerClassName == None) | ||
| } | ||
|
|
||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this is a binary-incompatible change to a DeveloperAPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the source for DeveloperApi, a Developer API is unstable and can change between minor releases.