Skip to content

Conversation

@yaooqinn
Copy link
Member

What changes were proposed in this pull request?

This PR makes encode/decode functions report coding errors instead of mojibake for unmappable characters, take select encode('渭城朝雨浥轻尘', 'US-ASCII') as an example

Before this PR,

???????

After this PR,

org.apache.spark.SparkRuntimeException
{
  "errorClass" : "MALFORMED_CHARACTER_CODING",
  "sqlState" : "22000",
  "messageParameters" : {
    "charset" : "US-ASCII",
    "function" : "`encode`"
  }
}

Why are the changes needed?

Improve data quality.

Does this PR introduce any user-facing change?

Yes.

When set spark.sql.legacy.codingErrorAction to true, encode/decode functions replace unmappable characters with mojibake instead of reporting coding errors.

How was this patch tested?

new unit tests

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jun 19, 2024
@yaooqinn
Copy link
Member Author

@yaooqinn
Copy link
Member Author

Hi @cloud-fan, I have addressed your comments. The expressions are now replaced at runtime by static invoke, and the string representations no longer contain those legacy flags.

@yaooqinn yaooqinn requested a review from cloud-fan June 24, 2024 03:35
test("SPARK-22543: split large if expressions into blocks due to JVM code size limit") {
var strExpr: Expression = Literal("abc")
for (_ <- 1 to 150) {
strExpr = StringDecode(Encode(strExpr, "utf-8"), "utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a different expression for testing? The codegen size is greatly decreased after using StaticInvoke in Encode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. StringTrim

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

"instead of reporting coding errors.")
.version("4.0.0")
.booleanConf
.createWithDefault(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it should be a fallback conf to ANSI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasons I'd like to make it independent of ANSI are:

  • Part of the implication of ANSI is Hive-incompatibility,
  • Hive also reports coding errors, so it was a mistake when we ported this from hive
  • These functions are not ANSI-defined
  • The error behaviors are also not found in ANSI

The reasons mentioned above indicate that this behavior is more of a legacy trait of Spark itself.

@yaooqinn yaooqinn closed this in fb5697d Jun 24, 2024
@yaooqinn
Copy link
Member Author

Merged to master.

Thank you @cloud-fan @HyukjinKwon for the help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants