Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented May 24, 2016

What changes were proposed in this pull request?

Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.

  1. move validation logic to analyzer instead of encoder
  2. only have a resolveAndBind method in encoder instead of resolve and bind, as we don't have the encoder life cycle concept anymore.
  3. Dataset don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
  4. Using BoundReference to represent an unresolved field in deserializer expression is kind of weird, this PR adds a GetColumnByOrdinal for this purpose. (serializer expression still use BoundReference, we can replace it with GetColumnByOrdinal in follow-ups)

How was this patch tested?

existing test

@cloud-fan
Copy link
Contributor Author

@cloud-fan cloud-fan force-pushed the clean-encoder branch 2 times, most recently from 18fb49a to 3f51e4d Compare May 24, 2016 00:11
@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #59165 has finished for PR 13269 at commit 73e9c1a.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #59167 has finished for PR 13269 at commit 3f51e4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use the full name like keyEncoder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this if into pattern match, to reduce one ident level

@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59435 has finished for PR 13269 at commit 13fed35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class GetColumnByOrdinal(ordinal: Int, dataType: DataType) extends LeafExpression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we pass in an existing analyzer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Dataset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing is that SimpleAnalyzer uses case sensitive resolution, and it's hard coded, while Analyzer is configurable and used case insensitive resolution by default.

@liancheng
Copy link
Contributor

As we discussed offline, this PR also enables case insensitive encoder resolution. Would be nice to add a test case for it. Basically something like this:

case class A(a: String)

val data = Seq(
  "{ 'A': 'foo' }",
  "{ 'A': 'bar' }"
)

val df1 = spark.read.json(sc.parallelize(data))
df1.printSchema()
// root
//  |-- A: string (nullable = true)

val ds1 = df1.as[A]
ds1.printSchema()
// root
//  |-- a: string (nullable = true)

@yhuai
Copy link
Contributor

yhuai commented May 31, 2016

Are we going to break this PR to multiple smaller PRs?

@cloud-fan
Copy link
Contributor Author

#13402 is merged, and I have one more PR to send.

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59769 has finished for PR 13269 at commit e321e4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59772 has finished for PR 13269 at commit c294b3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are still using BoundReference for serializer expressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I mentioned it in the PR description

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see.

@SparkQA
Copy link

SparkQA commented Jun 2, 2016

Test build #59789 has finished for PR 13269 at commit b86bd01.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better?

val ordinals = deserializer.collect {
  case GetColumnByOrdinal(ordinal, _) => ordinal
}

ordinals.reduceOption(_ max _).foreach { maxOrdinal =>
  if (maxOrdinal != inputs.length - 1) {
    fail(inputs.toStructType, maxOrdinal)
  }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should also check that each ordinal from 0 to inputs.length - 1 appears in deserializer expression:

val ordinals = deserializer.collect {
  case GetColumnByOrdinal(ordinal, _) => ordinal
}.distinct.sorted

if (ordinals.nonEmpty && ordinals != (0 until inputs.length)) {
  fail(inputs.toStructType, ordinals.max)
}

@SparkQA
Copy link

SparkQA commented Jun 2, 2016

Test build #59863 has finished for PR 13269 at commit efa9616.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if expression can be simplified to:

exprToOrdinals.getOrElseUpdate(g.child, ArrayBuffer.empty[Int]) += g.ordinal

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59904 has finished for PR 13269 at commit fb26103.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59906 has finished for PR 13269 at commit 6c793a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Just rebased this branch.

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59920 has finished for PR 13269 at commit efe0cd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Merging to master and branch-2.0.

asfgit pushed a commit that referenced this pull request Jun 3, 2016
## What changes were proposed in this pull request?

Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.

1. move validation logic to analyzer instead of encoder
2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)

## How was this patch tested?

existing test

Author: Wenchen Fan <[email protected]>
Author: Cheng Lian <[email protected]>

Closes #13269 from cloud-fan/clean-encoder.

(cherry picked from commit 190ff27)
Signed-off-by: Cheng Lian <[email protected]>
@asfgit asfgit closed this in 190ff27 Jun 3, 2016
@liancheng
Copy link
Contributor

So there will be a follow-up for replacing BoundReference in serializer expressions with GetColumnByOrdinal, right?

@cloud-fan
Copy link
Contributor Author

yea, but may not happen before 2.0. It needs some more refactor to the object operator execution model. For example, the serializer in AppendColumns has no corresponding attribute, its input is an object we got from the given lambda function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants