[SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation. #21931

heary-cao · 2018-07-31T11:22:37Z

What changes were proposed in this pull request?

this pr add a configuration parameter to configure the capacity of fast aggregation.
Performance comparison:

 Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
 Aggregate w multiple keys:               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------
 fasthash = default                            5612 / 5882          3.7         267.6       1.0X
 fasthash = config                             3586 / 3595          5.8         171.0       1.6X

How was this patch tested?

the existed test cases.

holdensmagicalunicorn · 2018-07-31T11:22:42Z

@heary-cao, thanks! I am a bot who has found some folks who might be able to help with the review:@rxin, @cloud-fan and @yhuai

maropu · 2018-08-02T11:01:23Z

What does the benchmark result suggest? The result should be 1048576 by default?

heary-cao · 2018-08-02T11:20:11Z

@maropu, The test results show that we can make a configuration parameter for the capacity of fast hash. Currently capacity of our fast hash is related to the length of the recorded data. so I'm not sure how much the default value is configured, but it is unreasonable to configure it in CodeGen as a fixed value(int capacity = 1 << 16;).

heary-cao · 2018-08-03T09:06:48Z

cc @maropu

kiszk · 2018-08-03T11:42:57Z

...ore/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala

We can see the following code at L226. If a user specify 2^n value (e.g. 1024), it works functionally correct. What happens if a user specified non 2^n value (e.g. 127)?

idx = (idx + 1) & (numBuckets - 1);

heary-cao · 2018-08-04T02:52:50Z

@kiszk, Thank you for your suggestion. I have update it. Can you review it again if you have some time. thanks.

maropu · 2018-08-06T07:58:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

We need to accept these small values, e.g., 2^1, 2^2, ..? I think these are meaningless...

maropu · 2018-08-06T07:59:01Z

cc: @cloud-fan @hvanhovell

kiszk · 2018-08-06T08:01:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

This name looks too long. How about spark.sql.codegen.aggregate.map.row.capacitybit or others?

kiszk · 2018-08-06T08:04:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Can we describe something about this value? (e.g. bit not for value). Precisely, the actual numBuckets is determined by loadFactor, too.

Can we describe about the default value?

kiszk · 2018-08-06T08:20:35Z

Does this work when we set 30 into the parameter? I am afraid that several arrays with size 0x7fffffff are allocated.

heary-cao · 2018-08-06T12:07:52Z

@kiszk ,I'm not sure how much the maximum is set, and the size of 1G is the maximum value accepted by numBuckets. Of course, buckets is the memory of 8G.

kiszk · 2018-08-08T12:29:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

nit: 1 >> 16 -> 1 << 16

kiszk · 2018-08-08T12:32:07Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

nit: we need to reduce # of characters per line up to 100. IIUC, the number is more than 100.

cloud-fan · 2018-08-09T07:37:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

spark.sql.codegen.aggregate.fastHashMap.capacityBit?

heary-cao · 2018-08-15T02:21:05Z

@kiszk , @cloud-fan
I was on vacation some time ago, I'm sorry to delayed reply.
I have update it, Can you help to review it again if your have some times. thanks.

kiszk · 2018-08-19T01:22:26Z

LGTM, cc @cloud-fan @hvanhovell

viirya · 2018-08-19T01:59:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

the bit -> The bit is.

viirya · 2018-08-19T02:01:01Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

"fasthash = 20"?

viirya · 2018-08-19T02:02:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

"fasthash = 16"?

viirya · 2018-08-19T02:03:28Z

Minor comments. LGTM.

viirya · 2018-08-19T02:05:27Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

Benchmark("Capacity for fast hash aggregate")

heary-cao · 2018-08-22T09:43:18Z

cc @cloud-fan @hvanhovell

cloud-fan · 2018-08-22T15:00:06Z

ok to test

cloud-fan · 2018-08-22T15:04:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

why add it?

I'm sorry, update it. thanks.

SparkQA · 2018-08-22T15:05:03Z

Test build #95107 has finished for PR 21931 at commit 5ed0fda.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-22T15:07:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala

I don't see why we need to add benchmark, we just make the capacity configurable.

cloud-fan · 2018-08-22T15:08:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

can you explain how to calculate this valid range?

I am not sure about the minimum value. The initial configuration is 1. but I agree with @maropu suggestion. e.g., 2^1, 2^2,these are meaningless.
for the maximum is 30, because the integer bit is 31 and the actual numBuckets in fast hash map are determined by loadFactor and loadFactor is 0.5 too.
code is:
private double loadFactor = 0.5;
private int numBuckets = (int) (capacity / loadFactor);

SparkQA · 2018-08-23T12:24:58Z

Test build #95154 has finished for PR 21931 at commit 03c0a80.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-23T16:26:37Z

Test build #95157 has finished for PR 21931 at commit af2078c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…aggregate

SparkQA · 2018-08-24T05:57:58Z

Test build #95190 has finished for PR 21931 at commit 6abeb06.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

heary-cao · 2018-08-25T11:16:31Z

cc @cloud-fan @hvanhovell

maropu · 2018-08-25T11:31:22Z

LGTM

kiszk · 2018-08-25T11:47:01Z

LGTM again

cloud-fan · 2018-08-27T07:46:03Z

thanks, merging to master!

… to configure the capacity of fast aggregation. ## What changes were proposed in this pull request? this pr add a configuration parameter to configure the capacity of fast aggregation. Performance comparison: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel Aggregate w multiple keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ fasthash = default 5612 / 5882 3.7 267.6 1.0X fasthash = config 3586 / 3595 5.8 171.0 1.6X ``` ## How was this patch tested? the existed test cases. Closes apache#21931 from heary-cao/FastHashCapacity. Authored-by: caoxuewen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

kiszk reviewed Aug 3, 2018

View reviewed changes

heary-cao force-pushed the FastHashCapacity branch 2 times, most recently from 7f40697 to 1428309 Compare August 4, 2018 02:49

maropu reviewed Aug 6, 2018

View reviewed changes

kiszk reviewed Aug 6, 2018

View reviewed changes

heary-cao force-pushed the FastHashCapacity branch 2 times, most recently from 9984931 to 78cd048 Compare August 6, 2018 11:59

kiszk reviewed Aug 8, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated

Copy link

Member

kiszk Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 1 >> 16 -> 1 << 16

kiszk reviewed Aug 8, 2018

View reviewed changes

cloud-fan reviewed Aug 9, 2018

View reviewed changes

heary-cao force-pushed the FastHashCapacity branch from 78cd048 to 94f98ed Compare August 15, 2018 02:19

viirya reviewed Aug 19, 2018

View reviewed changes

heary-cao force-pushed the FastHashCapacity branch 2 times, most recently from eb05b02 to 5ed0fda Compare August 21, 2018 11:32

cloud-fan reviewed Aug 22, 2018

View reviewed changes

heary-cao force-pushed the FastHashCapacity branch from 5ed0fda to 03c0a80 Compare August 23, 2018 12:19

heary-cao force-pushed the FastHashCapacity branch from 03c0a80 to af2078c Compare August 23, 2018 12:35

Add a configuration parameter to configure the capacity of fast hash …

6abeb06

…aggregate

heary-cao force-pushed the FastHashCapacity branch from af2078c to 6abeb06 Compare August 24, 2018 02:08

asfgit closed this in 6193a20 Aug 27, 2018

[SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation. #21931

[SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation. #21931

Conversation

heary-cao commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdensmagicalunicorn commented Jul 31, 2018

Uh oh!

maropu commented Aug 2, 2018

Uh oh!

heary-cao commented Aug 2, 2018

Uh oh!

heary-cao commented Aug 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heary-cao commented Aug 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Aug 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Aug 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk commented Aug 6, 2018

Uh oh!

heary-cao commented Aug 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heary-cao commented Aug 15, 2018

Uh oh!

kiszk commented Aug 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Aug 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heary-cao commented Aug 22, 2018

Uh oh!

cloud-fan commented Aug 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 23, 2018

Uh oh!

SparkQA commented Aug 23, 2018

Uh oh!

SparkQA commented Aug 24, 2018

Uh oh!

heary-cao commented Aug 25, 2018

Uh oh!

maropu commented Aug 25, 2018

Uh oh!

heary-cao commented Jul 31, 2018 •

edited

Loading

kiszk Aug 6, 2018 •

edited

Loading

kiszk commented Aug 25, 2018 •

edited

Loading