Skip to content

Conversation

@eatoncys
Copy link
Contributor

@eatoncys eatoncys commented Dec 8, 2018

What changes were proposed in this pull request?

RDDConversions would get disproportionately slower as the number of columns in the query increased,
for the type of converters before is scala.collection.immutable.:: which is a subtype of list.
This PR removing RDDConversions and using RowEncoder to convert the Row to InternalRow.

The test of PrunedScanSuite for 2000 columns and 20k rows takes 409 seconds before this PR, and 361 seconds after.

How was this patch tested?

Test case of PrunedScanSuite

@SparkQA
Copy link

SparkQA commented Dec 8, 2018

Test build #99861 has finished for PR 23262 at commit ddb2528.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@eatoncys
Copy link
Contributor Author

eatoncys commented Dec 8, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Dec 8, 2018

Test build #99866 has finished for PR 23262 at commit ddb2528.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Dec 8, 2018

Good catch, LGTM
cc @cloud-fan

val numColumns = outputTypes.length
val mutableRow = new GenericInternalRow(numColumns)
val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter)
val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter).toArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use ExpressionEncoder here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good suggestion, and has been modified, would you like to review it again, thanks.

val numColumns = outputTypes.length
val mutableRow = new GenericInternalRow(numColumns)
val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter)
val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter).toArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use RowEncoder here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been modified, and the performance is the same as converting to arrays.

object RDDConversions {
def productToRowRdd[A <: Product](data: RDD[A], outputTypes: Seq[DataType]): RDD[InternalRow] = {
def productToRowRdd[A <: Product : TypeTag](data: RDD[A],
outputSchema: StructType): RDD[InternalRow] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, seems like this is never used actually... shall we remove it instead if this is the case?

* Convert the objects inside Row into the types Catalyst expected.
*/
def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): RDD[InternalRow] = {
def rowToRowRdd(data: RDD[Row], outputSchema: StructType): RDD[InternalRow] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove whole object. rowToRowRdd looks only being used at one place and the code here is quite small.

*/
def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): RDD[InternalRow] = {
def rowToRowRdd(data: RDD[Row], outputSchema: StructType): RDD[InternalRow] = {
val converters = RowEncoder(outputSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked each case. Every case looks fine except one case:

case chr: Char => UTF8String.fromString(chr.toString)

Looks we're going to drop Char as StringType. I think it's trivial and rather a mistake that we supported this. I don't feel strongly about documenting it in migration guide but if anyone feels so, we better do that.

@HyukjinKwon
Copy link
Member

LGTM otheriwse

@eatoncys
Copy link
Contributor Author

@HyukjinKwon @mgaido91 Thanks for review. @cloud-fan @kiszk Would you like to give some suggestions: remove the object RDDConversions , or leave it there?

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99899 has finished for PR 23262 at commit 89f3191.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Let's remove. No point of keeping unused method. The code will remain in the commit anyway. Also, there's no quite good point of keeping few lines method that's called only at one place.

@eatoncys
Copy link
Contributor Author

@HyukjinKwon Ok, removed it, thanks for review.

execution.RDDConversions.rowToRowRdd(rdd, output.map(_.dataType))
val converters = RowEncoder(StructType.fromAttributes(output))
rdd.mapPartitions { iterator =>
iterator.map { r =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: iterator.map(converters.toRow)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified, thanks.

@cloud-fan
Copy link
Contributor

LGTM, can you update the PR title and description?

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99903 has finished for PR 23262 at commit 907d3f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@eatoncys eatoncys changed the title [SPARK-26312][SQL]Converting converters in RDDConversions into arrays to improve their access performance [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance Dec 10, 2018
@eatoncys eatoncys changed the title [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance Dec 10, 2018
@eatoncys
Copy link
Contributor Author

@cloud-fan Updated, thanks.

@eatoncys
Copy link
Contributor Author

retest this please

@mgaido91
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99911 has finished for PR 23262 at commit e8817e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99916 has finished for PR 23262 at commit 9758534.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99910 has finished for PR 23262 at commit 56cf4e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 10, 2018

Test build #99913 has finished for PR 23262 at commit 9758534.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@HyukjinKwon
Copy link
Member

It doesn't work right now .. right? haha. I tried to merge but failed. Some how the apache spark repo looks down (?).

@cloud-fan
Copy link
Contributor

yea it didn't work. The PR is not merged. I'll try it later. cc @srowen do you hit the same issue? I already switched to gitbox

@HyukjinKwon
Copy link
Member

Oops, looks working fine after switching to gitbox to me.

@HyukjinKwon
Copy link
Member

Weird. My remote is as below FWIW.

$ git remote -v
apache	https://gitbox.apache.org/repos/asf/spark.git (fetch)
apache	https://gitbox.apache.org/repos/asf/spark.git (push)
...

@asfgit asfgit closed this in cbe9230 Dec 11, 2018
@srowen
Copy link
Member

srowen commented Dec 11, 2018

It's been working for me, with the merge script and with GitHub and gitbox

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… to improve its conversion performance

## What changes were proposed in this pull request?

`RDDConversions` would get disproportionately slower as the number of columns in the query increased,
for the type of `converters` before is `scala.collection.immutable.::` which is a subtype of list.
This PR removing `RDDConversions` and using `RowEncoder` to convert the Row to InternalRow.

The test of `PrunedScanSuite` for 2000 columns and 20k rows takes 409 seconds before this PR, and 361 seconds after.

## How was this patch tested?

Test case of `PrunedScanSuite`

Closes apache#23262 from eatoncys/toarray.

Authored-by: 10129659 <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants