[SPARK-15494][SQL] encoder code cleanup #13269

cloud-fan · 2016-05-24T00:06:50Z

What changes were proposed in this pull request?

Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.

move validation logic to analyzer instead of encoder
only have a resolveAndBind method in encoder instead of resolve and bind, as we don't have the encoder life cycle concept anymore.
Dataset don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
Using BoundReference to represent an unresolved field in deserializer expression is kind of weird, this PR adds a GetColumnByOrdinal for this purpose. (serializer expression still use BoundReference, we can replace it with GetColumnByOrdinal in follow-ups)

How was this patch tested?

existing test

cloud-fan · 2016-05-24T00:07:34Z

cc @marmbrus @liancheng @yhuai @clockfly

SparkQA · 2016-05-24T00:14:20Z

Test build #59165 has finished for PR 13269 at commit 73e9c1a.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-24T03:34:51Z

Test build #59167 has finished for PR 13269 at commit 3f51e4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

clockfly · 2016-05-25T17:34:00Z

sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala

Is it better to use the full name like keyEncoder?

cloud-fan · 2016-05-27T00:21:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

put this if into pattern match, to reduce one ident level

SparkQA · 2016-05-27T01:49:55Z

Test build #59435 has finished for PR 13269 at commit 13fed35.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GetColumnByOrdinal(ordinal: Int, dataType: DataType) extends LeafExpression

yhuai · 2016-05-27T17:32:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

Where do we pass in an existing analyzer?

One thing is that SimpleAnalyzer uses case sensitive resolution, and it's hard coded, while Analyzer is configurable and used case insensitive resolution by default.

liancheng · 2016-05-27T21:03:31Z

As we discussed offline, this PR also enables case insensitive encoder resolution. Would be nice to add a test case for it. Basically something like this:

case class A(a: String)

val data = Seq(
  "{ 'A': 'foo' }",
  "{ 'A': 'bar' }"
)

val df1 = spark.read.json(sc.parallelize(data))
df1.printSchema()
// root
//  |-- A: string (nullable = true)

val ds1 = df1.as[A]
ds1.printSchema()
// root
//  |-- a: string (nullable = true)

yhuai · 2016-05-31T18:01:07Z

Are we going to break this PR to multiple smaller PRs?

cloud-fan · 2016-05-31T20:10:40Z

#13402 is merged, and I have one more PR to send.

SparkQA · 2016-06-01T22:15:41Z

Test build #59769 has finished for PR 13269 at commit e321e4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-01T23:05:32Z

Test build #59772 has finished for PR 13269 at commit c294b3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-06-01T23:22:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala

So we are still using BoundReference for serializer expressions?

yea, I mentioned it in the PR description

SparkQA · 2016-06-02T01:37:10Z

Test build #59789 has finished for PR 13269 at commit b86bd01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-06-02T06:57:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Is this better?

val ordinals = deserializer.collect { case GetColumnByOrdinal(ordinal, _) => ordinal } ordinals.reduceOption(_ max _).foreach { maxOrdinal => if (maxOrdinal != inputs.length - 1) { fail(inputs.toStructType, maxOrdinal) } }

Actually we should also check that each ordinal from 0 to inputs.length - 1 appears in deserializer expression:

val ordinals = deserializer.collect { case GetColumnByOrdinal(ordinal, _) => ordinal }.distinct.sorted if (ordinals.nonEmpty && ordinals != (0 until inputs.length)) { fail(inputs.toStructType, ordinals.max) }

SparkQA · 2016-06-02T20:00:08Z

Test build #59863 has finished for PR 13269 at commit efa9616.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-06-02T23:44:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

This if expression can be simplified to:

exprToOrdinals.getOrElseUpdate(g.child, ArrayBuffer.empty[Int]) += g.ordinal

SparkQA · 2016-06-03T00:24:10Z

Test build #59904 has finished for PR 13269 at commit fb26103.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-03T02:46:24Z

Test build #59906 has finished for PR 13269 at commit 6c793a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-06-03T05:43:55Z

Just rebased this branch.

SparkQA · 2016-06-03T07:33:30Z

Test build #59920 has finished for PR 13269 at commit efe0cd5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-06-03T07:42:34Z

Merging to master and branch-2.0.

## What changes were proposed in this pull request? Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions. 1. move validation logic to analyzer instead of encoder 2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore. 3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework. 4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups) ## How was this patch tested? existing test Author: Wenchen Fan <[email protected]> Author: Cheng Lian <[email protected]> Closes #13269 from cloud-fan/clean-encoder. (cherry picked from commit 190ff27) Signed-off-by: Cheng Lian <[email protected]>

liancheng · 2016-06-03T18:08:01Z

So there will be a follow-up for replacing BoundReference in serializer expressions with GetColumnByOrdinal, right?

cloud-fan · 2016-06-03T18:27:01Z

yea, but may not happen before 2.0. It needs some more refactor to the object operator execution model. For example, the serializer in AppendColumns has no corresponding attribute, its input is an object we got from the given lambda function.

cloud-fan force-pushed the clean-encoder branch 2 times, most recently from 18fb49a to 3f51e4d Compare May 24, 2016 00:11

clockfly reviewed May 25, 2016
View reviewed changes

cloud-fan force-pushed the clean-encoder branch from 3f51e4d to 13fed35 Compare May 27, 2016 00:20

cloud-fan reviewed May 27, 2016
View reviewed changes

yhuai reviewed May 27, 2016
View reviewed changes

cloud-fan force-pushed the clean-encoder branch from e321e4c to c294b3b Compare June 1, 2016 21:02

liancheng reviewed Jun 1, 2016
View reviewed changes

liancheng reviewed Jun 2, 2016
View reviewed changes

liancheng mentioned this pull request Jun 3, 2016

[SPARK-15547][SQL] nested case class in encoder can have different number of fields from the real schema #13474

Closed

cloud-fan added 3 commits June 2, 2016 22:29

encoder code cleanup

25a4d19

use GetColumnByOrdinal instead of BoundReference

2920e38

update

c923486

cloud-fan and others added 3 commits June 2, 2016 22:32

address comments

1377b5e

address comments

52531c3

Fixes ScalaStyle check failure

efe0cd5

liancheng force-pushed the clean-encoder branch from 6c793a8 to efe0cd5 Compare June 3, 2016 05:33

liancheng mentioned this pull request Jun 3, 2016

[SPARK-15140][SQL] make the semantics of null input object for encoder clear #13469

Closed

asfgit closed this in 190ff27 Jun 3, 2016

[SPARK-15494][SQL] encoder code cleanup #13269

[SPARK-15494][SQL] encoder code cleanup #13269

Uh oh!

Conversation

cloud-fan commented May 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented May 24, 2016

Uh oh!

SparkQA commented May 24, 2016

Uh oh!

SparkQA commented May 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented May 27, 2016

Uh oh!

yhuai commented May 31, 2016

Uh oh!

cloud-fan commented May 31, 2016

Uh oh!

SparkQA commented Jun 1, 2016

Uh oh!

SparkQA commented Jun 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

liancheng commented Jun 3, 2016

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

liancheng commented Jun 3, 2016

Uh oh!

liancheng commented Jun 3, 2016

Uh oh!

cloud-fan commented Jun 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan commented May 24, 2016 •

edited

Loading