-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29033][SQL][WIP] Always use UnsafeRow-based version of CreateNamedStruct #25745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29033][SQL][WIP] Always use UnsafeRow-based version of CreateNamedStruct #25745
Conversation
…n, then delete CreateNamedStructUnsafe
… InternalRow implementations
| case (result: Float, expected: Float) => | ||
| if (expected.isNaN) result.isNaN else expected == result | ||
| case (result: Row, expected: InternalRow) => result.toSeq == expected.toSeq(result.schema) | ||
| case (result: Seq[InternalRow], expected: Seq[InternalRow]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was needed because we can't do direct equals() comparison between UnsafeRow and other row classes. After this PR's changes, the "SPARK-14793: split wide struct creation into blocks due to JVM code size limit" test case in CodeGenerationSuite was failing because the new code was producing UnsafeRow but the test code was comparing against GenericInternalRow. In the old code, this comparison between sequences of rows was happening in the default case _ => below, but that case doesn't work when the InternalRow implementations are mismatched.
I'm not sure whether this change-of-internal-row-format will have adverse consequences in other parts of the code.
|
As a high-level illustration of why I think this might improve performance, compare the following before-and-after excerpts from running the following in Before: /* 021 */ public void init(int index, scala.collection.Iterator[] inputs) {
/* 022 */ partitionIndex = index;
/* 023 */ this.inputs = inputs;
/* 024 */
/* 025 */ range_taskContext_0 = TaskContext.get();
/* 026 */ range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics();
/* 027 */ range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 028 */ range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 029 */ range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32);
/* 030 */ range_mutableStateArray_0[3] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(range_mutableStateArray_0[2], 2);
/* 031 */
/* 032 */ }
/* 033 */
/* 034 */ private void project_doConsume_0(long project_expr_0_0) throws java.io.IOException {
/* 035 */ Object[] project_values_0 = new Object[2];
/* 036 */
/* 037 */ if (false) {
/* 038 */ project_values_0[0] = null;
/* 039 */ } else {
/* 040 */ project_values_0[0] = project_expr_0_0;
/* 041 */ }
/* 042 */
/* 043 */ if (false) {
/* 044 */ project_values_0[1] = null;
/* 045 */ } else {
/* 046 */ project_values_0[1] = project_expr_0_0;
/* 047 */ }
/* 048 */
/* 049 */ final InternalRow project_value_1 = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(project_values_0);
/* 050 */ project_values_0 = null;
/* 051 */ range_mutableStateArray_0[2].reset();
/* 052 */
/* 053 */ final InternalRow project_tmpInput_0 = project_value_1;
/* 054 */ if (project_tmpInput_0 instanceof UnsafeRow) {
/* 055 */ range_mutableStateArray_0[2].write(0, (UnsafeRow) project_tmpInput_0);
/* 056 */ } else {
/* 057 */ // Remember the current cursor so that we can calculate how many bytes are
/* 058 */ // written later.
/* 059 */ final int project_previousCursor_0 = range_mutableStateArray_0[2].cursor();
/* 060 */
/* 061 */ range_mutableStateArray_0[3].resetRowWriter();
/* 062 */
/* 063 */ range_mutableStateArray_0[3].write(0, (project_tmpInput_0.getLong(0)));
/* 064 */
/* 065 */ range_mutableStateArray_0[3].write(1, (project_tmpInput_0.getLong(1)));
/* 066 */
/* 067 */ range_mutableStateArray_0[2].setOffsetAndSizeFromPreviousCursor(0, project_previousCursor_0);
/* 068 */ }
/* 069 */ append((range_mutableStateArray_0[2].getRow()));
/* 070 */
/* 071 */ }After: /* 021 */ public void init(int index, scala.collection.Iterator[] inputs) {
/* 022 */ partitionIndex = index;
/* 023 */ this.inputs = inputs;
/* 024 */
/* 025 */ range_taskContext_0 = TaskContext.get();
/* 026 */ range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics();
/* 027 */ range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 028 */ range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 029 */ range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0);
/* 030 */ range_mutableStateArray_0[3] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32);
/* 031 */ range_mutableStateArray_0[4] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(range_mutableStateArray_0[3], 2);
/* 032 */
/* 033 */ }
/* 034 */
/* 035 */ private void project_doConsume_0(long project_expr_0_0) throws java.io.IOException {
/* 036 */ range_mutableStateArray_0[2].reset();
/* 037 */
/* 038 */ range_mutableStateArray_0[2].write(0, project_expr_0_0);
/* 039 */
/* 040 */ range_mutableStateArray_0[2].write(1, project_expr_0_0);
/* 041 */ range_mutableStateArray_0[3].reset();
/* 042 */
/* 043 */ final InternalRow project_tmpInput_0 = (range_mutableStateArray_0[2].getRow());
/* 044 */ if (project_tmpInput_0 instanceof UnsafeRow) {
/* 045 */ range_mutableStateArray_0[3].write(0, (UnsafeRow) project_tmpInput_0);
/* 046 */ } else {
/* 047 */ // Remember the current cursor so that we can calculate how many bytes are
/* 048 */ // written later.
/* 049 */ final int project_previousCursor_0 = range_mutableStateArray_0[3].cursor();
/* 050 */
/* 051 */ range_mutableStateArray_0[4].resetRowWriter();
/* 052 */
/* 053 */ range_mutableStateArray_0[4].write(0, (project_tmpInput_0.getLong(0)));
/* 054 */
/* 055 */ range_mutableStateArray_0[4].write(1, (project_tmpInput_0.getLong(1)));
/* 056 */
/* 057 */ range_mutableStateArray_0[3].setOffsetAndSizeFromPreviousCursor(0, project_previousCursor_0);
/* 058 */ }
/* 059 */ append((range_mutableStateArray_0[3].getRow()));
/* 060 */
/* 061 */ } |
|
Test build #110424 has finished for PR 25745 at commit
|
|
Retest this please. |
|
While we're at it, there could be significant wins from eliminating the |
| */ | ||
| case class CreateNamedStructUnsafe(children: Seq[Expression]) extends CreateNamedStructLike { | ||
| override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | ||
| val eval = GenerateUnsafeProjection.createCode(ctx, valExprs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any types GenerateUnsafeProjection doesn't support? From GenerateUnsafeProjection.canSupport, looks no.
|
Test build #110643 has finished for PR 25745 at commit
|
|
Just for reference, we are considering to do opposite, removing CreateNamedStructUnsafe as it brings some issue on MapObjects in some cases. #26173 |
|
@HeartSaVioR, thanks for the link. I'm closing this PR and will resolve my JIRA as "Won't Fix". |
|
@HeartSaVioR, if this improves the performance in general, we might rather have to revert #26173 but take this since apparently that PR does not have any user-facing change. Can you clarify it at #26173 (comment)? |
WIPpending benchmarking (to verify that it's actually faster) and time to think about possible corner-cases.What changes were proposed in this pull request?
Spark 2.x has two separate implementations of the "create named struct" expression: regular
CreateNamedStructandCreateNamedStructUnsafe. The former producesGenericInternalRows, while the latter producesUnsafeRows. Both expressions extend theCreateNameStructLiketrait.The "unsafe" version was added in SPARK-9373 / #7689 to support structs in
GenerateUnsafeProjection(this was fairly early in the Tungsten effort, circa mid-2015 / Spark 1.5.x).This PR changes Spark so the UnsafeRow-based codepath is always used and removes the GenericRow-based path. For ease-of-review, I've broken this into two commits:
CreateNamedStructto use theCreateNamedStructUnsafeimplementation (producingUnsafeRow) and deletesCreateNamedStructUnsafe.CreateNameStructLiketrait (since at that point it only has a single implementation).Why are the changes needed?
The old
CreateNamedStructcode path allocated a freshGenericInternalRowon every invocation, incurring object allocation and primitive-boxing performance overheads.I suspect that this can be a significant performance problem in code which uses
Datasets: Spark'sExpressionEncoderusesCreateNamedStructfor serializing / deserializing case classes, so theGenericInternalRowperformance problems can impact typed Dataset operations (e.g..map()).I spotted this while doing a deep-dive into Encoder-generated code.
I think there's also code-simplification benefits: if we only have a single implementation of
CreateNamedStructthen we're removing the possibility for bugs in case code authors pattern-match on concrete classes instead of theCreateNamedStructLiketrait.Does this PR introduce any user-facing change?
There is no expected user-facing behavioral change.
However, there is an binary-compatibility-breaking change due to the deletion of
CreateNamedStructUnsafeandCreateNamedStructLike. I suspect that these were originally intended to be internal / private classes and therefore propose that this is an acceptable breaking change in Spark 3.x (since it seems pretty unlikely that Spark users would be directly using those internal interfaces).How was this patch tested?
I believe this is covered by Spark's existing tests.
CreateNamedStructUnsafeis already used inGenerateUnsafeProjectionand thus already had some pre-existing test coverage.