-
Notifications
You must be signed in to change notification settings - Fork 29k
[SQL] [SPARK-6620] Speed up toDF() and rdd() functions by constructing converters in ScalaReflection #5279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SQL] [SPARK-6620] Speed up toDF() and rdd() functions by constructing converters in ScalaReflection #5279
Changes from 1 commit
41b2aa9
8cad6e2
881dc60
afa3aa5
74301fa
dec6802
c327bc9
11a20ec
e75a387
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -33,7 +33,8 @@ object CatalystTypeConverters { | |
| import scala.collection.Map | ||
|
|
||
| /** | ||
| * Converts Scala objects to catalyst rows / types. | ||
| * Converts Scala objects to catalyst rows / types. This method is slow, and for batch | ||
| * conversion you should be using converter produced by createToCatalystConverter. | ||
| * Note: This is always called after schemaFor has been called. | ||
| * This ordering is important for UDT registration. | ||
| */ | ||
|
|
@@ -97,6 +98,8 @@ object CatalystTypeConverters { | |
|
|
||
| /** | ||
| * Creates a converter function that will convert Scala objects to the specified catalyst type. | ||
| * Typical use case would be converting a collection of rows that have the same schema. You will | ||
| * call this function once to get a converter, and apply it to every row. | ||
| */ | ||
| private[sql] def createToCatalystConverter(dataType: DataType): Any => Any = { | ||
| def extractOption(item: Any): Any = item match { | ||
|
|
@@ -181,7 +184,10 @@ object CatalystTypeConverters { | |
| } | ||
| } | ||
|
|
||
| /** Converts Catalyst types used internally in rows to standard Scala types */ | ||
| /** Converts Catalyst types used internally in rows to standard Scala types | ||
| * This method is slow, and for batch conversion you should be using converter | ||
| * produced by createToScalaConverter. | ||
| */ | ||
| def convertToScala(a: Any, dataType: DataType): Any = (a, dataType) match { | ||
| // Check UDT first since UDTs can override other types | ||
| case (d, udt: UserDefinedType[_]) => | ||
|
|
@@ -210,6 +216,8 @@ object CatalystTypeConverters { | |
|
|
||
| /** | ||
| * Creates a converter function that will convert Catalyst types to Scala type. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also add a note in this guy and its counterpart the pattern that they're expected to be used in. I was just reading executeCollect() and realized it may not be obvious to someone coming from there. Just something like "use this during batch conversion, such as within a mapPartitions, to generate a function which efficiently converts Catalyst types back to Scala types for a particular schema."
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added. |
||
| * Typical use case would be converting a collection of rows that have the same schema. You will | ||
| * call this function once to get a converter, and apply it to every row. | ||
| */ | ||
| private[sql] def createToScalaConverter(dataType: DataType): Any => Any = dataType match { | ||
| // Check UDT first since UDTs can override other types | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -140,10 +140,8 @@ case class TakeOrdered(limit: Int, sortOrder: Seq[SortOrder], child: SparkPlan) | |
| private def collectData(): Array[Row] = child.execute().map(_.copy()).takeOrdered(limit)(ord) | ||
|
|
||
| override def executeCollect(): Array[Row] = { | ||
| val converters = schema.fields.map { | ||
| f => CatalystTypeConverters.createToScalaConverter(f.dataType) | ||
| } | ||
| collectData().map(CatalystTypeConverters.convertRowWithConverters(_, schema, converters)) | ||
| val converter = CatalystTypeConverters.createToScalaConverter(schema) | ||
| collectData().map(converter(_).asInstanceOf[Row]) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any reason that
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess the answer here is "udfs".
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| } | ||
|
|
||
| // TODO: Terminal split should be implemented differently from non-terminal split. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong comment style.