-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15494][SQL] encoder code cleanup #13269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
18fb49a to
3f51e4d
Compare
|
Test build #59165 has finished for PR 13269 at commit
|
|
Test build #59167 has finished for PR 13269 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to use the full name like keyEncoder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this if into pattern match, to reduce one ident level
|
Test build #59435 has finished for PR 13269 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we pass in an existing analyzer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in Dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing is that SimpleAnalyzer uses case sensitive resolution, and it's hard coded, while Analyzer is configurable and used case insensitive resolution by default.
|
As we discussed offline, this PR also enables case insensitive encoder resolution. Would be nice to add a test case for it. Basically something like this: case class A(a: String)
val data = Seq(
"{ 'A': 'foo' }",
"{ 'A': 'bar' }"
)
val df1 = spark.read.json(sc.parallelize(data))
df1.printSchema()
// root
// |-- A: string (nullable = true)
val ds1 = df1.as[A]
ds1.printSchema()
// root
// |-- a: string (nullable = true) |
|
Are we going to break this PR to multiple smaller PRs? |
|
#13402 is merged, and I have one more PR to send. |
|
Test build #59769 has finished for PR 13269 at commit
|
|
Test build #59772 has finished for PR 13269 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are still using BoundReference for serializer expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I mentioned it in the PR description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see.
|
Test build #59789 has finished for PR 13269 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better?
val ordinals = deserializer.collect {
case GetColumnByOrdinal(ordinal, _) => ordinal
}
ordinals.reduceOption(_ max _).foreach { maxOrdinal =>
if (maxOrdinal != inputs.length - 1) {
fail(inputs.toStructType, maxOrdinal)
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should also check that each ordinal from 0 to inputs.length - 1 appears in deserializer expression:
val ordinals = deserializer.collect {
case GetColumnByOrdinal(ordinal, _) => ordinal
}.distinct.sorted
if (ordinals.nonEmpty && ordinals != (0 until inputs.length)) {
fail(inputs.toStructType, ordinals.max)
}|
Test build #59863 has finished for PR 13269 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if expression can be simplified to:
exprToOrdinals.getOrElseUpdate(g.child, ArrayBuffer.empty[Int]) += g.ordinal|
Test build #59904 has finished for PR 13269 at commit
|
|
Test build #59906 has finished for PR 13269 at commit
|
|
Just rebased this branch. |
|
Test build #59920 has finished for PR 13269 at commit
|
|
Merging to master and branch-2.0. |
## What changes were proposed in this pull request? Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions. 1. move validation logic to analyzer instead of encoder 2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore. 3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework. 4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups) ## How was this patch tested? existing test Author: Wenchen Fan <[email protected]> Author: Cheng Lian <[email protected]> Closes #13269 from cloud-fan/clean-encoder. (cherry picked from commit 190ff27) Signed-off-by: Cheng Lian <[email protected]>
|
So there will be a follow-up for replacing |
|
yea, but may not happen before 2.0. It needs some more refactor to the object operator execution model. For example, the serializer in |
What changes were proposed in this pull request?
Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.
resolveAndBindmethod in encoder instead ofresolveandbind, as we don't have the encoder life cycle concept anymore.Datasetdon't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.BoundReferenceto represent an unresolved field in deserializer expression is kind of weird, this PR adds aGetColumnByOrdinalfor this purpose. (serializer expression still useBoundReference, we can replace it withGetColumnByOrdinalin follow-ups)How was this patch tested?
existing test