[SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch #22731

gatorsmile · 2018-10-15T17:51:10Z

What changes were proposed in this pull request?

This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path.

For row-based scan, we keep the original way.
For the columnar scan, we just need to update the stats after each batch.

How was this patch tested?

N/A

gatorsmile · 2018-10-15T17:51:43Z

cc @10110346 @srowen @cloud-fan

srowen · 2018-10-15T18:02:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

        // don't need to run this `if` for every record.
        val preNumRecordsRead = inputMetrics.recordsRead
        if (nextElement.isInstanceOf[ColumnarBatch]) {
+          incTaskInputMetricsBytesRead()


I see, so always update when processing ColumnarBatch, but use the previous logic otherwise. That seems OK. It should still address the original problem.

... I guess the only possible drawback is that if the number of records in a ColumnarBatch is pretty small, then this could cause it to update bytes read a lot more frequently than before. Bu if the number of records is large (>100) then this won't matter.

4096 is the default number of the batch reader in both ORC and Parquet. If the users set the conf to a much smaller number, they will face the perf regression due to the the extra overhead in many places. I do not think end users will do this.

Makes sense. In this case the behavior should be the same before and after this change, but it's therefore fine, too.

Considering that the default value of "spark.sql.parquet.columnarReaderBatchSize is 4096, this change is better .

SparkQA · 2018-10-15T21:19:42Z

Test build #97405 has finished for PR 22731 at commit 8731588.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-15T21:36:31Z

Test build #97406 has finished for PR 22731 at commit 7c3fd54.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-16T00:58:56Z

LGTM, merging to master!

srowen · 2018-10-16T01:48:02Z

I'm going to merge this back to 2.3, as I had merged the original change back to 2.3

## What changes were proposed in this pull request? This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. ## How was this patch tested? N/A Closes #22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4cee191) Signed-off-by: Sean Owen <[email protected]>

This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. N/A Closes #22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4cee191) Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? This PR is a follow-up of apache#22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. ## How was this patch tested? N/A Closes apache#22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

fix

8731588

reorder it

7c3fd54

srowen reviewed Oct 15, 2018

View reviewed changes

asfgit closed this in 4cee191 Oct 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch #22731

[SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch #22731

Uh oh!

gatorsmile commented Oct 15, 2018

Uh oh!

gatorsmile commented Oct 15, 2018

Uh oh!

srowen Oct 15, 2018

Uh oh!

srowen Oct 15, 2018

Uh oh!

gatorsmile Oct 15, 2018

Uh oh!

srowen Oct 15, 2018

Uh oh!

10110346 Oct 16, 2018

Uh oh!

SparkQA commented Oct 15, 2018

Uh oh!

SparkQA commented Oct 15, 2018

Uh oh!

cloud-fan commented Oct 16, 2018

Uh oh!

srowen commented Oct 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch #22731

[SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch #22731

Uh oh!

Conversation

gatorsmile commented Oct 15, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Oct 15, 2018

Uh oh!

srowen Oct 15, 2018

Choose a reason for hiding this comment

Uh oh!

srowen Oct 15, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Oct 15, 2018

Choose a reason for hiding this comment

Uh oh!

srowen Oct 15, 2018

Choose a reason for hiding this comment

Uh oh!

10110346 Oct 16, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 15, 2018

Uh oh!

SparkQA commented Oct 15, 2018

Uh oh!

cloud-fan commented Oct 16, 2018

Uh oh!

srowen commented Oct 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants