-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22570][SQL] Avoid to create a lot of global variables by using a local variable with allocation of an object in generated code #19797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I do not know whether to stop reusing an object here is good or not.
@cloud-fan @maropu @viirya I would like to hear your opinions.
|
Test build #84113 has finished for PR 19797 at commit
|
|
Test build #84115 has finished for PR 19797 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we create a new wrapper every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would work well, too. We may have some overhead to create a small object IntWrapper and collect it at GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't GC work very well in this case? My concern is, if there is no significant performance benefit, reusing global variables may be too hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If we do not reuse global variables, we can use only a local variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we worry about name collision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you worry about name collision among different methods? Since the lifetime of this object is very short, I intentionally use the same name for the same type (IntWrapper or LongWrapper) to reuse the object.
We could use different names among different methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a help method to CodegenContext? Like reuseOrAddMutableState?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This don't really simplify the code since you still have the else block below. It complicates the codes here actually. So I think it is better to be unchanged.
|
Test build #84124 has finished for PR 19797 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any specific reason why here we are not following the approach used for IntWrapper and LongWrapper above? ie. defining it as a local variable, instead of reusing the same global object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my question. In this case, we may see larger size of an object.
Which one do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I understand only now your comment there. Honestly I may have a partial vision of this then please correct me if I am wrong, but my preference is for local variable for two reasons:
- in the case we have a lot of methods and we have inner classes containing the methods, using a variable of the outer class adds (at least) a constant pool entry, while using a local variable doesn't (as per my understanding);
- since we reinitialize every time the object with new assignments, I think we are not saving any GC (we are reusing only the pointer basically, which I don't think is a big problem).
What do you think?
|
Test build #84133 has finished for PR 19797 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not do this, it's too hacky and error-prone. We should just create the object every time, or create global variable every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will drop this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still create an object array every time here, which doesn't help with the GC. I think we can just create a local variable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we can just create the string buffer every time, the overhead is small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we may fold global variables to arrays, and these tests become useless. I think a more reliable way is: check the number of global variables in CodegenContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate this thought?
Do you want to this new check in test cases? Or do you mean we will remove these tests for global variables since we will have new check in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this, we have a codegen context, and do codegen. After that, instead of compiling the code to make sure it doesn't fail, we can just check the size of ctx.mutableStates to confirm that we don't add global variables during codegen
|
Test build #84314 has finished for PR 19797 at commit
|
|
Test build #84324 has finished for PR 19797 at commit
|
| checkEvaluation(nonNullExpr, "num-num", row1) | ||
| } | ||
|
|
||
| test("SPARK-22570: should not create a lot of instance variables") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: RegExpReplace should not create a lot of global variables
| (1 to N).map(_ => expr.genCode(ctx).code) | ||
| // four global variables (lastRegex, pattern, lastReplacement, and lastReplacementInUTF8) | ||
| // are always required | ||
| assert(ctx.mutableStates.length == 4 * N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify it to
val ctx = new CodegenContext
RegExpReplace(Literal("100"), Literal("(\\d+)"), Literal("num")).genCode(ctx)
// ...
assert(ctx.mutableStates.length == 4)
|
|
||
| test("SPARK-22570: should not create a lot of instance variables") { | ||
| val ctx = new CodegenContext | ||
| (1 to 60000).map(i => CreateArray(Seq(Literal(s"$i"))).genCode(ctx).code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why loop?
| comparePlans(Optimizer execute query, expected) | ||
| } | ||
|
|
||
| test("SPARK-22570: should not create a lot of instance variables") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: CreateArray should not create a lot of global variables
| (1 to N).map(i => StructField(s"i$i", LongType)).toArray) | ||
| val input2 = Row.fromSeq((1 to N).map(i => i.toString)) | ||
| val output2 = Row.fromSeq((1 to N).map(i => i.toLong)) | ||
| checkEvaluation(cast(Literal.create(input2, from2), to2), output2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also simplify this test like the other 2?
| test("SPARK-22570: Cast should not create a lot of instance variables") { | ||
| val ctx = new CodegenContext | ||
| cast("1", IntegerType).genCode(ctx).code | ||
| cast("2", LongType).genCode(ctx).code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need to call code
|
|
||
| test("SPARK-22570: CreateArray should not create a lot of global variables") { | ||
| val ctx = new CodegenContext | ||
| CreateArray(Seq(Literal(1))).genCode(ctx).code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
LGTM |
| checkEvaluation(cast(Literal.create(inputOuter, fromOuter), toOuter), outputOuter) | ||
| } | ||
|
|
||
| test("SPARK-22570: Cast should not create a lot of instance variables") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: instance variables -> global variables. To match with other two added tests.
| import org.apache.spark.SparkFunSuite | ||
| import org.apache.spark.sql.AnalysisException | ||
| import org.apache.spark.sql.catalyst.dsl.expressions._ | ||
| import org.apache.spark.sql.catalyst.expressions.codegen.{CodeAndComment, CodegenContext, CodeGenerator} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeAndComment and CodeGenerator are not used.
| import org.apache.spark.sql.catalyst.dsl.expressions._ | ||
| import org.apache.spark.sql.catalyst.dsl.plans._ | ||
| import org.apache.spark.sql.catalyst.expressions._ | ||
| import org.apache.spark.sql.catalyst.expressions.codegen.{CodeAndComment, CodegenContext, CodeGenerator} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
|
LGTM |
|
Test build #84337 has finished for PR 19797 at commit
|
|
Test build #84341 has finished for PR 19797 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
This PR reduces # of global variables in generated code by replacing a global variable with a local variable with an allocation of an object every time. When a lot of global variables were generated, the generated code may meet 64K constant pool limit.
This PR reduces # of generated global variables in the following three operations:
Castwith String to primitive byte/short/int/longRegExpReplaceCreateArrayI intentionally leave this part. This is because this variable keeps a class that is dynamically generated. In other word, it is not possible to reuse one class.
How was this patch tested?
Added test cases