Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Nov 22, 2017

What changes were proposed in this pull request?

This PR reduces # of global variables in generated code by replacing a global variable with a local variable with an allocation of an object every time. When a lot of global variables were generated, the generated code may meet 64K constant pool limit.
This PR reduces # of generated global variables in the following three operations:

  • Cast with String to primitive byte/short/int/long
  • RegExpReplace
  • CreateArray

I intentionally leave this part. This is because this variable keeps a class that is dynamically generated. In other word, it is not possible to reuse one class.

How was this patch tested?

Added test cases

Copy link
Member Author

@kiszk kiszk Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I do not know whether to stop reusing an object here is good or not.
@cloud-fan @maropu @viirya I would like to hear your opinions.

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84113 has finished for PR 19797 at commit 8ba2f59.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84115 has finished for PR 19797 at commit 7a49c41.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we create a new wrapper every time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would work well, too. We may have some overhead to create a small object IntWrapper and collect it at GC.

Copy link
Contributor

@cloud-fan cloud-fan Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't GC work very well in this case? My concern is, if there is no significant performance benefit, reusing global variables may be too hacky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If we do not reuse global variables, we can use only a local variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we worry about name collision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you worry about name collision among different methods? Since the lifetime of this object is very short, I intentionally use the same name for the same type (IntWrapper or LongWrapper) to reuse the object.
We could use different names among different methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a help method to CodegenContext? Like reuseOrAddMutableState?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This don't really simplify the code since you still have the else block below. It complicates the codes here actually. So I think it is better to be unchanged.

@SparkQA
Copy link

SparkQA commented Nov 23, 2017

Test build #84124 has finished for PR 19797 at commit 4d9657a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any specific reason why here we are not following the approach used for IntWrapper and LongWrapper above? ie. defining it as a local variable, instead of reusing the same global object?

Copy link
Member Author

@kiszk kiszk Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my question. In this case, we may see larger size of an object.
Which one do you suggest?

Copy link
Contributor

@mgaido91 mgaido91 Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I understand only now your comment there. Honestly I may have a partial vision of this then please correct me if I am wrong, but my preference is for local variable for two reasons:

  • in the case we have a lot of methods and we have inner classes containing the methods, using a variable of the outer class adds (at least) a constant pool entry, while using a local variable doesn't (as per my understanding);
  • since we reinitialize every time the object with new assignments, I think we are not saving any GC (we are reusing only the pointer basically, which I don't think is a big problem).

What do you think?

@SparkQA
Copy link

SparkQA commented Nov 23, 2017

Test build #84133 has finished for PR 19797 at commit 39f1b6d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not do this, it's too hacky and error-prone. We should just create the object every time, or create global variable every time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I will drop this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still create an object array every time here, which doesn't help with the GC. I think we can just create a local variable here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it local variable, I would like to use a new API at #19821.
After merging #19821, I will commit new version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can just create the string buffer every time, the overhead is small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we may fold global variables to arrays, and these tests become useless. I think a more reliable way is: check the number of global variables in CodegenContext

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate this thought?
Do you want to this new check in test cases? Or do you mean we will remove these tests for global variables since we will have new check in another PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this, we have a codegen context, and do codegen. After that, instead of compiling the code to make sure it doesn't fail, we can just check the size of ctx.mutableStates to confirm that we don't add global variables during codegen

@kiszk kiszk changed the title [SPARK-22570][SQL] Avoid to create a lot of global variables to reuse an object in generated code [SPARK-22570][SQL] Avoid to create a lot of global variables by using a local variable with allocation of an object in generated code Nov 29, 2017
@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #84314 has finished for PR 19797 at commit 36a330d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2017

Test build #84324 has finished for PR 19797 at commit 3b8459d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkEvaluation(nonNullExpr, "num-num", row1)
}

test("SPARK-22570: should not create a lot of instance variables") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RegExpReplace should not create a lot of global variables

(1 to N).map(_ => expr.genCode(ctx).code)
// four global variables (lastRegex, pattern, lastReplacement, and lastReplacementInUTF8)
// are always required
assert(ctx.mutableStates.length == 4 * N)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify it to

val ctx = new CodegenContext
RegExpReplace(Literal("100"), Literal("(\\d+)"), Literal("num")).genCode(ctx)
// ...
assert(ctx.mutableStates.length == 4)


test("SPARK-22570: should not create a lot of instance variables") {
val ctx = new CodegenContext
(1 to 60000).map(i => CreateArray(Seq(Literal(s"$i"))).genCode(ctx).code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why loop?

comparePlans(Optimizer execute query, expected)
}

test("SPARK-22570: should not create a lot of instance variables") {
Copy link
Contributor

@cloud-fan cloud-fan Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: CreateArray should not create a lot of global variables

(1 to N).map(i => StructField(s"i$i", LongType)).toArray)
val input2 = Row.fromSeq((1 to N).map(i => i.toString))
val output2 = Row.fromSeq((1 to N).map(i => i.toLong))
checkEvaluation(cast(Literal.create(input2, from2), to2), output2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also simplify this test like the other 2?

test("SPARK-22570: Cast should not create a lot of instance variables") {
val ctx = new CodegenContext
cast("1", IntegerType).genCode(ctx).code
cast("2", LongType).genCode(ctx).code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to call code


test("SPARK-22570: CreateArray should not create a lot of global variables") {
val ctx = new CodegenContext
CreateArray(Seq(Literal(1))).genCode(ctx).code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@cloud-fan
Copy link
Contributor

LGTM

checkEvaluation(cast(Literal.create(inputOuter, fromOuter), toOuter), outputOuter)
}

test("SPARK-22570: Cast should not create a lot of instance variables") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instance variables -> global variables. To match with other two added tests.

import org.apache.spark.SparkFunSuite
import org.apache.spark.sql.AnalysisException
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.expressions.codegen.{CodeAndComment, CodegenContext, CodeGenerator}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeAndComment and CodeGenerator are not used.

import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.expressions.codegen.{CodeAndComment, CodegenContext, CodeGenerator}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@viirya
Copy link
Member

viirya commented Nov 30, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Nov 30, 2017

Test build #84337 has finished for PR 19797 at commit a455ac3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2017

Test build #84341 has finished for PR 19797 at commit 747e215.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants