-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files #20937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
b2e92b4
cb2f27b
0d45fd3
1fb9b32
c3b04ee
93d3879
15798a1
cc05ce9
74f2026
4856b8e
084f41f
31cd793
6eacd18
3b4a509
cd1124e
ebf5390
c5b6a35
ef5e6c6
f9b6ad1
3b7714c
edb9167
5ba2881
1509e10
e3184b3
87d259c
76c1d08
88395b5
f2f8ae7
b451a03
c13c159
1cb3ac0
108e8e7
0d20cc6
54baf9f
1d50d94
bb53798
961b482
a794988
dccdaa2
d0abab7
6741796
e4faae1
01f4ef5
24cedb9
d40dda2
ad6496c
358863d
7e5be5e
d138d2d
c26ef5d
5f0b069
ef8248f
2efac08
b2020fa
f99c1e1
6d13d00
77112ef
d632706
bbff402
3af996b
8253811
ab8210c
7c6f115
f553b07
d6a07a1
cb12ea3
eb2965b
7a4865c
dbeb0c1
ac67020
d96b720
75f7bb6
d93dcdc
65b4b73
6b52419
6116bac
5383400
1aeae3c
7e20891
0d3ed3c
5d5c295
e7be77d
6bd841a
6a62679
3b30ce0
fcd0a21
af71324
76dbbed
3207e59
b817184
15df9af
36253f4
aa69559
c35d5d1
58fc5c6
63b5894
1ace082
6c0df03
b4c0d38
f2a259f
482b799
a0ab98b
a7be182
e0cebf4
d3d28aa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -363,11 +363,11 @@ class JacksonParser( | |
| throw BadRecordException(() => recordLiteral(record), () => None, e) | ||
| case e: CharConversionException if options.encoding.isEmpty => | ||
| val msg = | ||
| """Failed to parse a character. Charset was detected automatically. | ||
| |You might want to set it explicitly via the charset option like: | ||
| | .option("charset", "UTF-8") | ||
| |Example of supported charsets: | ||
| | UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE | ||
| """Failed to parse a character. Encoding was detected automatically. | ||
| |You might want to set it explicitly via the encoding option like: | ||
| | .option("encoding", "UTF-8") | ||
|
||
| |Example of supported encodings: | ||
| | UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE | ||
| |""".stripMargin + e.getMessage | ||
| throw new CharConversionException(msg) | ||
|
||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2070,9 +2070,9 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |
| // Read | ||
| val data = | ||
| s""" | ||
| | {"f": | ||
| |"a", "f0": 1}$lineSep{"f": | ||
| | | ||
| | {"f": | ||
| |"a", "f0": 1}$lineSep{"f": | ||
| | | ||
| |"c", "f0": 2}$lineSep{"f": "d", "f0": 3} | ||
| """.stripMargin | ||
| val dataWithTrailingLineSep = s"$data$lineSep" | ||
|
|
@@ -2140,9 +2140,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |
| .option("encoding", "UTF-16") | ||
| .json(testFile(fileName)) | ||
|
|
||
| checkAnswer(jsonDF, Seq( | ||
| Row("Chris", "Baird"), Row("Doug", "Rood") | ||
| )) | ||
| checkAnswer(jsonDF, Seq(Row("Chris", "Baird"), Row("Doug", "Rood"))) | ||
| } | ||
|
|
||
| test("SPARK-23723: multi-line json in UTF-32BE with BOM") { | ||
|
|
@@ -2207,10 +2205,9 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |
| } | ||
|
|
||
| def checkEncoding( | ||
| expectedEncoding: String, | ||
| pathToJsonFiles: String, | ||
| expectedContent: String | ||
| ): Unit = { | ||
| expectedEncoding: String, | ||
| pathToJsonFiles: String, | ||
| expectedContent: String): Unit = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it should be per https://github.com/databricks/scala-style-guide#spacing-and-indentation or if it fits per databricks/scala-style-guide#58 (comment) Not a big deal |
||
| val jsonFiles = new File(pathToJsonFiles) | ||
| .listFiles() | ||
| .filter(_.isFile) | ||
|
|
@@ -2288,13 +2285,8 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |
| } | ||
| } | ||
|
|
||
| def checkReadJson( | ||
| lineSep: String, | ||
| encodingOption: String, | ||
| encoding: String, | ||
| inferSchema: Boolean, | ||
| runId: Int | ||
| ): Unit = { | ||
| def checkReadJson(lineSep: String, encodingOption: String, encoding: String, | ||
| inferSchema: Boolean, runId: Int): Unit = { | ||
| test(s"SPARK-23724: checks reading json in ${encoding} #${runId}") { | ||
| val lineSepInBytes = { | ||
| if (lineSep.startsWith("x")) { | ||
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think automatic detection is true only when multuline is enabled. We can just describe it in documentation and, reward this message or even just remove this message too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message was added explicitly to tell our customers how to resolve issues like https://issues.apache.org/jira/browse/SPARK-23094 . Describing that in docs is not enough from our experience. Customers will just create support tickets, and we will have to spend time to figure out the root causes. The tip can help the customers to solve the problem on their side. /cc @brkyvz
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is automatic detection is true only when multuline is enabled and the message looks like it's always true. I don't think we should expose an incomplete (or accidential) functionality in any case. We already found many holes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, speaking about this concrete exception handling. The exception with the message is thrown ONLY when options.encoding.isEmpty is
true. It meansencodingis not set and actual encoding of a file was autodetected. Themsgsays about that actually:Encoding was detected automatically.Maybe
encodingwas detected correctly but the file contains a wrong char. In that case, the first sentence says thisFailed to parse a character. The same could happen if you setencodingexplicitly because you cannot guarantee that inputs are alway correct.Wrong char in input file can be in a file with UTF-8 read with
multiline = falseand in a file in UTF-16LE withmultiline = true.My point is the mention of the
multilineoption in the error message doesn't help to user to solve the issue. A possible solution is to setencodingexplicitly - what the message says actually.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think
Encoding was detected automaticallyis quite correct. It might not help user solve the issue but it gives less correct information. They could thought it detects encoding correctly regardless ofmultilineoption.Think about this scenario: users somehow get this exception and read
Failed to parse a character. Encoding was detected automatically.. What would they think? I would think somehow the file is somehow failed to read but it looks detecting the encoding in the file correctly automatically regardless of other options.It's annoying to debug encoding related stuff in my experience. It would be nicer if we give the correct information as much as we can.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am saying let's document and clarify the condition that the automatic encoding detection feature is only when
multiLineis enabled officially, which is true.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is absolutely correct. If
encodingis not set, it is detected automatically by jackson. Look at the conditionif options.encoding.isEmpty =>.It gives absolutely correct information.
The message DOESN'T say that
encodingdetected correctly.They will look at the proposed solution
You might want to set it explicitly via the encoding option likeand will setencodingIt could be true even
encodingis set correctlyI don't know why you decided that. I see nothing about
encodingcorrectness in the message.What is your suggestion for the error message?
I agree let's document that thought it is not related to this PR. This PR doesn't change behavior of encoding auto detection. And it must not change the behavior from my point of view. If you want to restrict the encoding auto-detection mechanism somehow, please, create separate PR. We will discuss separately what kind of customer's apps it will break.