Skip to content

Conversation

@gatorsmile
Copy link
Member

What changes were proposed in this pull request?

This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path.

  • For row-based scan, we keep the original way.
  • For the columnar scan, we just need to update the stats after each batch.

How was this patch tested?

N/A

@gatorsmile
Copy link
Member Author

cc @10110346 @srowen @cloud-fan

// don't need to run this `if` for every record.
val preNumRecordsRead = inputMetrics.recordsRead
if (nextElement.isInstanceOf[ColumnarBatch]) {
incTaskInputMetricsBytesRead()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so always update when processing ColumnarBatch, but use the previous logic otherwise. That seems OK. It should still address the original problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I guess the only possible drawback is that if the number of records in a ColumnarBatch is pretty small, then this could cause it to update bytes read a lot more frequently than before. Bu if the number of records is large (>100) then this won't matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4096 is the default number of the batch reader in both ORC and Parquet. If the users set the conf to a much smaller number, they will face the perf regression due to the the extra overhead in many places. I do not think end users will do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. In this case the behavior should be the same before and after this change, but it's therefore fine, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that the default value of "spark.sql.parquet.columnarReaderBatchSize is 4096, this change is better .

@SparkQA
Copy link

SparkQA commented Oct 15, 2018

Test build #97405 has finished for PR 22731 at commit 8731588.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2018

Test build #97406 has finished for PR 22731 at commit 7c3fd54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 4cee191 Oct 16, 2018
@srowen
Copy link
Member

srowen commented Oct 16, 2018

I'm going to merge this back to 2.3, as I had merged the original change back to 2.3

asfgit pushed a commit that referenced this pull request Oct 16, 2018
## What changes were proposed in this pull request?
This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path.

- For row-based scan, we keep the original way.
- For the columnar scan, we just need to update the stats after each batch.

## How was this patch tested?
N/A

Closes #22731 from gatorsmile/udpateStatsFileScanRDD.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 4cee191)
Signed-off-by: Sean Owen <[email protected]>
asfgit pushed a commit that referenced this pull request Oct 16, 2018
This PR is a follow-up of #22594 . This alternative can avoid the unneeded computation in the hot code path.

- For row-based scan, we keep the original way.
- For the columnar scan, we just need to update the stats after each batch.

N/A

Closes #22731 from gatorsmile/udpateStatsFileScanRDD.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 4cee191)
Signed-off-by: Sean Owen <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?
This PR is a follow-up of apache#22594 . This alternative can avoid the unneeded computation in the hot code path.

- For row-based scan, we keep the original way.
- For the columnar scan, we just need to update the stats after each batch.

## How was this patch tested?
N/A

Closes apache#22731 from gatorsmile/udpateStatsFileScanRDD.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants