[SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance #23262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

eatoncys wants to merge 7 commits into apache:master from eatoncys:toarray

Contributor

eatoncys commented Dec 8, 2018 •

edited

Loading

What changes were proposed in this pull request?

RDDConversions would get disproportionately slower as the number of columns in the query increased,
for the type of converters before is scala.collection.immutable.:: which is a subtype of list.
This PR removing RDDConversions and using RowEncoder to convert the Row to InternalRow.

The test of PrunedScanSuite for 2000 columns and 20k rows takes 409 seconds before this PR, and 361 seconds after.

How was this patch tested?

Test case of PrunedScanSuite


          Converting converters in RDDConversions into arrays to improve their …

ddb2528

…access performance

SparkQA commented Dec 8, 2018

Test build #99861 has finished for PR 23262 at commit ddb2528.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Contributor Author

eatoncys commented Dec 8, 2018

retest this please

SparkQA commented Dec 8, 2018

Test build #99866 has finished for PR 23262 at commit ddb2528.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Member

kiszk commented Dec 8, 2018

Good catch, LGTM
cc @cloud-fan

cloud-fan reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala Outdated

    
                    val numColumns = outputTypes.length

                    val mutableRow = new GenericInternalRow(numColumns)

                    val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter)

                    val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter).toArray

Contributor

cloud-fan Dec 9, 2018

shall we use ExpressionEncoder here?

Contributor Author

eatoncys Dec 10, 2018

It is a good suggestion, and has been modified, would you like to review it again, thanks.

cloud-fan reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala Outdated

    
                    val numColumns = outputTypes.length

                    val mutableRow = new GenericInternalRow(numColumns)

                    val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter)

                    val converters = outputTypes.map(CatalystTypeConverters.createToCatalystConverter).toArray

Contributor

cloud-fan Dec 9, 2018

shall we use RowEncoder here?

Contributor Author

eatoncys Dec 10, 2018

It has been modified, and the performance is the same as converting to arrays.

eatoncys added 2 commits

December 10, 2018 16:08


          Using RowEncoder

89f3191


          Using same parameter

907d3f3

mgaido91 reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala Outdated

    
              object RDDConversions {

                def productToRowRdd[A <: Product](data: RDD[A], outputTypes: Seq[DataType]): RDD[InternalRow] = {

                def productToRowRdd[A <: Product : TypeTag](data: RDD[A],

                                                            outputSchema: StructType): RDD[InternalRow] = {

Contributor

mgaido91 Dec 10, 2018

nit: indent

Contributor

mgaido91 Dec 10, 2018

well, seems like this is never used actually... shall we remove it instead if this is the case?

HyukjinKwon reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala Outdated

    
                 * Convert the objects inside Row into the types Catalyst expected.

                 */

                def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): RDD[InternalRow] = {

                def rowToRowRdd(data: RDD[Row], outputSchema: StructType): RDD[InternalRow] = {

Member

HyukjinKwon Dec 10, 2018

Let's remove whole object. rowToRowRdd looks only being used at one place and the code here is quite small.

HyukjinKwon reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala Outdated

    
                 */

                def rowToRowRdd(data: RDD[Row], outputTypes: Seq[DataType]): RDD[InternalRow] = {

                def rowToRowRdd(data: RDD[Row], outputSchema: StructType): RDD[InternalRow] = {

                  val converters = RowEncoder(outputSchema)

Member

HyukjinKwon Dec 10, 2018

I checked each case. Every case looks fine except one case:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala

Line 289 in fa0d4bf

case chr: Char => UTF8String.fromString(chr.toString)

Looks we're going to drop Char as StringType. I think it's trivial and rather a mistake that we supported this. I don't feel strongly about documenting it in migration guide but if anyone feels so, we better do that.

Member

HyukjinKwon commented Dec 10, 2018

LGTM otheriwse


          indent

56cf4e5

Contributor Author

eatoncys commented Dec 10, 2018

@HyukjinKwon @mgaido91 Thanks for review. @cloud-fan @kiszk Would you like to give some suggestions: remove the object RDDConversions , or leave it there?

SparkQA commented Dec 10, 2018

Test build #99899 has finished for PR 23262 at commit 89f3191.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Member

HyukjinKwon commented Dec 10, 2018

Let's remove. No point of keeping unused method. The code will remain in the commit anyway. Also, there's no quite good point of keeping few lines method that's called only at one place.

eatoncys added 2 commits

December 10, 2018 19:44


          Remove RDDConversions

41937f5


          Remove RDDConversions

e8817e3

Contributor Author

eatoncys commented Dec 10, 2018

@HyukjinKwon Ok, removed it, thanks for review.

cloud-fan reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala Outdated

    
                    execution.RDDConversions.rowToRowRdd(rdd, output.map(_.dataType))

                    val converters = RowEncoder(StructType.fromAttributes(output))

                    rdd.mapPartitions { iterator =>

                      iterator.map { r =>

Contributor

cloud-fan Dec 10, 2018

nit: iterator.map(converters.toRow)

Contributor Author

eatoncys Dec 10, 2018

Modified, thanks.

Contributor

cloud-fan commented Dec 10, 2018

LGTM, can you update the PR title and description?


          Remove RDDConversions

SparkQA commented Dec 10, 2018

Test build #99903 has finished for PR 23262 at commit 907d3f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

eatoncys changed the title ~~[SPARK-26312][SQL]Converting converters in RDDConversions into arrays to improve their access performance~~ [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance

eatoncys changed the title ~~[SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance~~ [SPARK-26312][SQL]Replace RDDConversions.rowToRowRdd with RowEncoder to improve its conversion performance

Contributor Author

eatoncys commented Dec 10, 2018

@cloud-fan Updated, thanks.

Contributor Author

eatoncys commented Dec 10, 2018

retest this please

Contributor

mgaido91 commented Dec 10, 2018

LGTM

SparkQA commented Dec 10, 2018

Test build #99911 has finished for PR 23262 at commit e8817e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA commented Dec 10, 2018

Test build #99916 has finished for PR 23262 at commit 9758534.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya approved these changes

View reviewed changes

SparkQA commented Dec 10, 2018

Test build #99910 has finished for PR 23262 at commit 56cf4e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA commented Dec 10, 2018

Test build #99913 has finished for PR 23262 at commit 9758534.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Contributor

cloud-fan commented Dec 11, 2018

thanks, merging to master!

Member

HyukjinKwon commented Dec 11, 2018

It doesn't work right now .. right? haha. I tried to merge but failed. Some how the apache spark repo looks down (?).

Contributor

cloud-fan commented Dec 11, 2018

yea it didn't work. The PR is not merged. I'll try it later. cc @srowen do you hit the same issue? I already switched to gitbox

Member

HyukjinKwon commented Dec 11, 2018

Oops, looks working fine after switching to gitbox to me.

Member

HyukjinKwon commented Dec 11, 2018

Weird. My remote is as below FWIW.

$ git remote -v
apache	https://gitbox.apache.org/repos/asf/spark.git (fetch)
apache	https://gitbox.apache.org/repos/asf/spark.git (push)
...

asfgit closed this in

cbe9230

Member

srowen commented Dec 11, 2018

It's been working for me, with the merge script and with GitHub and gitbox

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request


          [SPARK-26312][SQL] Replace RDDConversions.rowToRowRdd with RowEncoder…

dc21986

… to improve its conversion performance

## What changes were proposed in this pull request?

`RDDConversions` would get disproportionately slower as the number of columns in the query increased,
for the type of `converters` before is `scala.collection.immutable.::` which is a subtype of list.
This PR removing `RDDConversions` and using `RowEncoder` to convert the Row to InternalRow.

The test of `PrunedScanSuite` for 2000 columns and 20k rows takes 409 seconds before this PR, and 361 seconds after.

## How was this patch tested?

Test case of `PrunedScanSuite`

Closes apache#23262 from eatoncys/toarray.

Authored-by: 10129659 <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>

cjuexuan mentioned this pull request

spark3升级后es-spark不能正常工作的分析 cjuexuan/mynote#72

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet