Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Nov 16, 2018

What changes were proposed in this pull request?

When doing typed aggregation on a Dataset, for struct key type, the key attribute is named as "key". But for non-struct type, the key attribute is named as "value". This key attribute should also be named as "key" for non-struct type.

How was this patch tested?

Added test.

@viirya
Copy link
Member Author

viirya commented Nov 16, 2018

cc @cloud-fan

@cloud-fan
Copy link
Contributor

makes sense to me. This is a behavior change right? Shall we write a migration guide?

@SparkQA
Copy link

SparkQA commented Nov 16, 2018

Test build #98891 has finished for PR 23054 at commit c7bbe91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 16, 2018

Ok. Let me update migration guide.

@SparkQA
Copy link

SparkQA commented Nov 16, 2018

Test build #98902 has finished for PR 23054 at commit 42e32ad.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Nov 16, 2018

Test build #98907 has finished for PR 23054 at commit 42e32ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2018

Test build #98961 has finished for PR 23054 at commit 2b697dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 17, 2018

We should add a “legacy” flag in case somebody’s workload gets broken by this. We can remove the legacy flag in a future release.

@viirya
Copy link
Member Author

viirya commented Nov 18, 2018

Ok. I will add a flag. Thanks @rxin

@SparkQA
Copy link

SparkQA commented Nov 18, 2018

Test build #98977 has finished for PR 23054 at commit 6e3c37a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 18, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 18, 2018

Test build #98978 has finished for PR 23054 at commit 6e3c37a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 18, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 18, 2018

Test build #98976 has finished for PR 23054 at commit 4f85876.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2018

Test build #98981 has finished for PR 23054 at commit 6e3c37a.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 18, 2018

BTW what does the non-primitive types look like? Do they get flattened, or is there a struct ?

@viirya
Copy link
Member Author

viirya commented Nov 19, 2018

For struct types there is a struct named "key".


- The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set.

- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is atomic type, e.g. int, string, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.atomicKeyAttributeGroupByKey` with a default value of `false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that, only struct type key has the key alias. So here we should say: if the key is non-struct type, e.g. int, string, array, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. More accurate.

.createWithDefault(false)

val LEGACY_ATOMIC_KEY_ATTRIBUTE_GROUP_BY_KEY =
buildConf("spark.sql.legacy.atomicKeyAttributeGroupByKey")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.legacy.dataset.aliasNonStructGroupingKey?

@viirya viirya changed the title [SPARK-26085][SQL] Key attribute of primitive type under typed aggregation should be named as "key" too [SPARK-26085][SQL] Key attribute of non-struct type under typed aggregation should be named as "key" too Nov 19, 2018
val keyColumn = if (!kExprEnc.isSerializedAsStruct) {
assert(groupingAttributes.length == 1)
groupingAttributes.head
if (SQLConf.get.aliasNonStructGroupingKey) {
Copy link
Contributor

@cloud-fan cloud-fan Nov 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should do the alias when config is true...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, don't we want to have "key" attribute and only have old "value" attribute when we turn on legacy config?

@SparkQA
Copy link

SparkQA commented Nov 19, 2018

Test build #98988 has finished for PR 23054 at commit b5cfda4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.createWithDefault(false)

val LEGACY_ALIAS_NON_STRUCT_GROUPING_KEY =
buildConf("spark.sql.legacy.dataset.aliasNonStructGroupingKey")
Copy link
Contributor

@rxin rxin Nov 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe aliasNonStructGroupingKeyAsValue, and default to false.

Then we can remove this in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. That makes sense. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 19, 2018

Test build #99009 has finished for PR 23054 at commit 3930a35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2018

Test build #99010 has finished for PR 23054 at commit f58e93e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

sorry it conflicts, can you resolve it? I think it's ready to go

@viirya
Copy link
Member Author

viirya commented Nov 20, 2018

@cloud-fan Yea, it's resolved. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99035 has finished for PR 23054 at commit 0ffdb4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2018

Test build #99096 has finished for PR 23054 at commit d784142.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 21, 2018

Test build #99097 has finished for PR 23054 at commit d784142.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 21, 2018

retest this please...

@SparkQA
Copy link

SparkQA commented Nov 21, 2018

Test build #99103 has finished for PR 23054 at commit d784142.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

hmmm it conflicts again...

@viirya
Copy link
Member Author

viirya commented Nov 21, 2018

yea, resolved again. :)

@SparkQA
Copy link

SparkQA commented Nov 22, 2018

Test build #99146 has finished for PR 23054 at commit 70cab8b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class RuleSummary(
  • class QueryPlanningTracker
  • class QueryExecution(

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in ab2eafb Nov 22, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…gation should be named as "key" too

## What changes were proposed in this pull request?

When doing typed aggregation on a Dataset, for struct key type, the key attribute is named as "key". But for non-struct type, the key attribute is named as "value". This key attribute should also be named as "key" for non-struct type.

## How was this patch tested?

Added test.

Closes apache#23054 from viirya/SPARK-26085.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@viirya viirya deleted the SPARK-26085 branch December 27, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants