Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
b2e92b4
Test for reading json in UTF-16 with BOM
MaxGekk Feb 11, 2018
cb2f27b
Use user's charset or autodetect it if the charset is not specified
MaxGekk Feb 11, 2018
0d45fd3
Added a type and a comment for charset
MaxGekk Feb 13, 2018
1fb9b32
Replacing the monadic chaining by matching because it is more readable
MaxGekk Feb 13, 2018
c3b04ee
Keeping the old method for backward compatibility
MaxGekk Feb 13, 2018
93d3879
testFile is moved into the test to make more local because it is used…
MaxGekk Feb 13, 2018
15798a1
Adding the charset as third parameter to the text method
MaxGekk Feb 13, 2018
cc05ce9
Removing whitespaces at the end of the line
MaxGekk Feb 13, 2018
74f2026
Fix the comment in javadoc style
MaxGekk Feb 13, 2018
4856b8e
Simplifying of the UTF-16 test
MaxGekk Feb 13, 2018
084f41f
A hint to the exception how to set the charset explicitly
MaxGekk Feb 15, 2018
31cd793
Fix for scala style checks
MaxGekk Feb 15, 2018
6eacd18
Run tests again
MaxGekk Feb 15, 2018
3b4a509
Improving of the exception message
MaxGekk Feb 15, 2018
cd1124e
Appended the original message to the exception
MaxGekk Feb 15, 2018
ebf5390
Multi-line reading of json file in utf-32
MaxGekk Feb 17, 2018
c5b6a35
Autodetect charset of jsons in the multiline mode
MaxGekk Feb 17, 2018
ef5e6c6
Test for reading a json in UTF-16LE in the multiline mode by using us…
MaxGekk Feb 17, 2018
f9b6ad1
Fix test: rename the test file - utf32be -> utf32BE
MaxGekk Feb 18, 2018
3b7714c
Fix code style
MaxGekk Feb 18, 2018
edb9167
Appending the create verb to the method for readability
MaxGekk Feb 18, 2018
5ba2881
Making the createParser as a separate private method
MaxGekk Feb 18, 2018
1509e10
Fix code style
MaxGekk Feb 18, 2018
e3184b3
Checks the charset option is supported
MaxGekk Feb 19, 2018
87d259c
Support charset as a parameter of the json method
MaxGekk Feb 19, 2018
76c1d08
Test for charset different from utf-8
MaxGekk Feb 19, 2018
88395b5
Description of the charset option of the json method
MaxGekk Feb 20, 2018
f2f8ae7
Minor changes in comments: added . at the end of a sentence
MaxGekk Feb 21, 2018
b451a03
Added a test for wrong charset name
MaxGekk Feb 21, 2018
c13c159
Testing that charset in any case is acceptable
MaxGekk Feb 21, 2018
1cb3ac0
Test: user specified wrong (but supported) charset
MaxGekk Feb 21, 2018
108e8e7
Set charset as an option
MaxGekk Feb 25, 2018
0d20cc6
Test: saving to json in UTF-32BE
MaxGekk Feb 23, 2018
54baf9f
Taking user's charset for saved json
MaxGekk Feb 23, 2018
1d50d94
Test: output charset is UTF-8 by default
MaxGekk Feb 23, 2018
bb53798
Changing the readJsonFiles method for readability
MaxGekk Mar 4, 2018
961b482
The test checks that json written by Spark can be read back
MaxGekk Mar 4, 2018
a794988
Adding the delimiter option encoded in base64
MaxGekk Feb 24, 2018
dccdaa2
Separator encoded as a sequence of bytes in hex
MaxGekk Feb 24, 2018
d0abab7
Refactoring: removed unused imports and renaming a parameter
MaxGekk Feb 24, 2018
6741796
The sep option is renamed to recordSeparator. The supported format is…
MaxGekk Mar 4, 2018
e4faae1
Renaming recordSeparator to recordDelimiter
MaxGekk Mar 18, 2018
01f4ef5
Comments for the recordDelimiter option
MaxGekk Mar 18, 2018
24cedb9
Support other formats of recordDelimiter
MaxGekk Mar 18, 2018
d40dda2
Checking different charsets and record delimiters
MaxGekk Mar 18, 2018
ad6496c
Renaming test's method to make it more readable
MaxGekk Mar 18, 2018
358863d
Test of reading json in different charsets and delimiters
MaxGekk Mar 18, 2018
7e5be5e
Fix inferring of csv schema for any charsets
MaxGekk Mar 18, 2018
d138d2d
Fix errors of scalastyle check
MaxGekk Mar 18, 2018
c26ef5d
Reserving format for regular expressions and concatenated json
MaxGekk Mar 22, 2018
5f0b069
Fix recordDelimiter tests
MaxGekk Mar 22, 2018
ef8248f
Additional cases are added to the delimiter test
MaxGekk Mar 22, 2018
2efac08
Renaming recordDelimiter to lineSeparator
MaxGekk Mar 22, 2018
b2020fa
Adding HyukjinKwon changes
MaxGekk Mar 22, 2018
f99c1e1
Revert lineSepInWrite back to lineSep
MaxGekk Mar 22, 2018
6d13d00
Merge remote-tracking branch 'origin/master' into json-line-sep
MaxGekk Mar 22, 2018
77112ef
Fix passing of the lineSeparator to HadoopFileLinesReader
MaxGekk Mar 22, 2018
d632706
Fix python style checking
MaxGekk Mar 23, 2018
bbff402
Fix text source tests and javadoc comments
MaxGekk Mar 23, 2018
3af996b
Merge branch 'json-charset' into json-charset-record-delimiter
MaxGekk Mar 27, 2018
8253811
Merge branch 'json-line-sep' into json-charset-record-delimiter
MaxGekk Mar 27, 2018
ab8210c
Getting UTF-8 as default charset for lineSep
MaxGekk Mar 27, 2018
7c6f115
Set charset different from UTF-8 in the test
MaxGekk Mar 27, 2018
f553b07
Fix for the charset test: charset wasn't specified
MaxGekk Mar 27, 2018
d6a07a1
Removing line leaved after merge
MaxGekk Mar 27, 2018
cb12ea3
Removing flexible format for lineSep
MaxGekk Mar 28, 2018
eb2965b
Adding ticket number to test titles
MaxGekk Mar 28, 2018
7a4865c
Making comments more precise
MaxGekk Mar 28, 2018
dbeb0c1
lineSep must be specified if charset is different from UTF-8
MaxGekk Mar 28, 2018
ac67020
Support encoding as a synonym for the charset option
MaxGekk Mar 29, 2018
d96b720
Merge remote-tracking branch 'origin/master' into json-encoding-line-sep
MaxGekk Mar 29, 2018
75f7bb6
Fix missing require and specifying field of internal row explicitly
MaxGekk Mar 29, 2018
d93dcdc
Making the doc generator happy
MaxGekk Mar 29, 2018
65b4b73
Making the encoding name as the primary name
MaxGekk Mar 31, 2018
6b52419
Blacklisting UTF-16 and UTF-32 in per-line mode
MaxGekk Mar 31, 2018
6116bac
Changes after code review
MaxGekk Mar 31, 2018
5383400
Renaming charset to encoding
MaxGekk Mar 31, 2018
1aeae3c
Changes requested by HyukjinKwon in the review
MaxGekk Apr 4, 2018
7e20891
Adding tests for SPARK-23094
MaxGekk Apr 4, 2018
0d3ed3c
Fix comments
MaxGekk Apr 5, 2018
5d5c295
Matching by encoding per each line is eliminated
MaxGekk Apr 6, 2018
e7be77d
Addressing Hyukjin's review comments
MaxGekk Apr 6, 2018
6bd841a
Fixes regarding to coding style
MaxGekk Apr 6, 2018
6a62679
Making lineSep as opt string
MaxGekk Apr 6, 2018
3b30ce0
Removing option name in a test
MaxGekk Apr 6, 2018
fcd0a21
Merge branch 'master' into json-encoding-line-sep
MaxGekk Apr 8, 2018
af71324
Addressing HyukjinKwon's review comments
MaxGekk Apr 8, 2018
76dbbed
Merge branch 'json-encoding-line-sep' of github.com:MaxGekk/spark-1 i…
MaxGekk Apr 8, 2018
3207e59
Merge remote-tracking branch 'origin/master' into json-encoding-line-sep
MaxGekk Apr 8, 2018
b817184
Making Scala style checker and compiler happy
MaxGekk Apr 8, 2018
15df9af
Merge remote-tracking branch 'origin/master' into json-encoding-line-sep
MaxGekk Apr 13, 2018
36253f4
Adressing Hyukjin Kwon's review comments
MaxGekk Apr 14, 2018
aa69559
Adding benchmarks for json reads
MaxGekk Apr 15, 2018
c35d5d1
Making Scala style checker happy
MaxGekk Apr 15, 2018
58fc5c6
Eliminate unneeded wrapping by ByteArrayInputStream per-line during s…
MaxGekk Apr 15, 2018
63b5894
Adding benchmarks for wide lines
MaxGekk Apr 15, 2018
1ace082
Making comments shorter
MaxGekk Apr 15, 2018
6c0df03
Removing empty line between spark's imports
MaxGekk Apr 15, 2018
b4c0d38
Creating a stream decoder with specific buffer size
MaxGekk Apr 15, 2018
f2a259f
Enable all JSON benchmarks
MaxGekk Apr 15, 2018
482b799
Addressing Hyukjin Kwon's review comments
MaxGekk Apr 22, 2018
a0ab98b
Addressing Wenchen Fan's review comments
MaxGekk Apr 23, 2018
a7be182
Merge branch 'json-encoding-line-sep' of github.com:MaxGekk/spark-1 i…
MaxGekk Apr 23, 2018
e0cebf4
Merge remote-tracking branch 'origin/master' into json-encoding-line-sep
MaxGekk Apr 27, 2018
d3d28aa
Addressing Hyukjin Kwon's review comments
MaxGekk Apr 28, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Addressing Hyukjin's review comments
  • Loading branch information
MaxGekk committed Apr 6, 2018
commit e7be77d52a1c13a8817eba086f25454c06981e6f
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,4 @@ private[sql] class JSONOptions(
allowBackslashEscapingAnyCharacter)
factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars)
}

def getTextOptions: Map[String, String] = {
Map[String, String]() ++
encoding.map("encoding" -> _) ++ lineSeparator.map("lineSep" -> _)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -363,11 +363,11 @@ class JacksonParser(
throw BadRecordException(() => recordLiteral(record), () => None, e)
case e: CharConversionException if options.encoding.isEmpty =>
val msg =
"""Failed to parse a character. Charset was detected automatically.
|You might want to set it explicitly via the charset option like:
| .option("charset", "UTF-8")
|Example of supported charsets:
| UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE
"""Failed to parse a character. Encoding was detected automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think automatic detection is true only when multuline is enabled. We can just describe it in documentation and, reward this message or even just remove this message too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message was added explicitly to tell our customers how to resolve issues like https://issues.apache.org/jira/browse/SPARK-23094 . Describing that in docs is not enough from our experience. Customers will just create support tickets, and we will have to spend time to figure out the root causes. The tip can help the customers to solve the problem on their side. /cc @brkyvz

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is automatic detection is true only when multuline is enabled and the message looks like it's always true. I don't think we should expose an incomplete (or accidential) functionality in any case. We already found many holes, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, speaking about this concrete exception handling. The exception with the message is thrown ONLY when options.encoding.isEmpty is true. It means encoding is not set and actual encoding of a file was autodetected. The msg says about that actually: Encoding was detected automatically.

Maybe encoding was detected correctly but the file contains a wrong char. In that case, the first sentence says this Failed to parse a character. The same could happen if you set encoding explicitly because you cannot guarantee that inputs are alway correct.

I think automatic detection is true only when multuline is enabled.

Wrong char in input file can be in a file with UTF-8 read with multiline = false and in a file in UTF-16LE with multiline = true.

My point is the mention of the multiline option in the error message doesn't help to user to solve the issue. A possible solution is to set encoding explicitly - what the message says actually.

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Encoding was detected automatically is quite correct. It might not help user solve the issue but it gives less correct information. They could thought it detects encoding correctly regardless of multiline option.

Think about this scenario: users somehow get this exception and read Failed to parse a character. Encoding was detected automatically.. What would they think? I would think somehow the file is somehow failed to read but it looks detecting the encoding in the file correctly automatically regardless of other options.

It's annoying to debug encoding related stuff in my experience. It would be nicer if we give the correct information as much as we can.

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am saying let's document and clarify the condition that the automatic encoding detection feature is only when multiLine is enabled officially, which is true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Encoding was detected automatically is not quite correct.

It is absolutely correct. If encoding is not set, it is detected automatically by jackson. Look at the condition if options.encoding.isEmpty =>.

It might not help user solve the issue but it gives less correct information.

It gives absolutely correct information.

They could thought it detects encoding correctly regardless of multiline option.

The message DOESN'T say that encoding detected correctly.

Think about this scenario: users somehow get this exception and read Failed to parse a character. Encoding was detected automatically.. What would they think?

They will look at the proposed solution You might want to set it explicitly via the encoding option like and will set encoding

I would think somehow the file is somehow failed to read

It could be true even encoding is set correctly

but it looks detecting the encoding in the file correctly automatically

I don't know why you decided that. I see nothing about encoding correctness in the message.

It's annoying to debug encoding related stuff in my experience. It would be nicer if we give the correct information as much as we can.

What is your suggestion for the error message?

I am saying let's document the automatic encoding detection feature only for multiLine officially, which is true.

I agree let's document that thought it is not related to this PR. This PR doesn't change behavior of encoding auto detection. And it must not change the behavior from my point of view. If you want to restrict the encoding auto-detection mechanism somehow, please, create separate PR. We will discuss separately what kind of customer's apps it will break.

|You might want to set it explicitly via the encoding option like:
| .option("encoding", "UTF-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for prose ...

|Example of supported encodings:
| UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE
|""".stripMargin + e.getMessage
throw new CharConversionException(msg)
Copy link
Contributor

@cloud-fan cloud-fan Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lose the original stack trace, we should something like

val newException = new CharConversionException(msg)
newException.setStackStrace(e.getStrackTrace)
throw newException

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW we should also follow the existing rule and wrap the exception with BadRecordException. See the code above.

}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,15 +111,18 @@ object TextInputJsonDataSource extends JsonDataSource {
private def createBaseDataset(
sparkSession: SparkSession,
inputPaths: Seq[FileStatus],
parsedOptions: JSONOptions
): Dataset[String] = {
parsedOptions: JSONOptions): Dataset[String] = {
val paths = inputPaths.map(_.getPath.toString)
val textOptions = Map.empty[String, String] ++
parsedOptions.encoding.map("encoding" -> _) ++
parsedOptions.lineSeparator.map("lineSep" -> _)

sparkSession.baseRelationToDataFrame(
DataSource.apply(
sparkSession,
paths = paths,
className = classOf[TextFileFormat].getName,
options = parsedOptions.getTextOptions
options = textOptions
).resolveRelation(checkFilesExist = false))
.select("value").as(Encoders.STRING)
}
Expand Down Expand Up @@ -163,8 +166,7 @@ object MultiLineJsonDataSource extends JsonDataSource {
JsonInferSchema.infer[PortableDataStream](
sampled,
parsedOptions,
createParser(_, _, parsedOptions.encoding)
)
createParser(_, _, parsedOptions.encoding))
}

private def createBaseRdd(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ private[json] class JsonOutputWriter(

private val encoding = options.encoding match {
case Some(charsetName) => Charset.forName(charsetName)
case _ => StandardCharsets.UTF_8
case None => StandardCharsets.UTF_8
}

private val writer = CodecStreams.createOutputStreamWriter(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2070,9 +2070,9 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
// Read
val data =
s"""
| {"f":
|"a", "f0": 1}$lineSep{"f":
|
| {"f":
|"a", "f0": 1}$lineSep{"f":
|
|"c", "f0": 2}$lineSep{"f": "d", "f0": 3}
""".stripMargin
val dataWithTrailingLineSep = s"$data$lineSep"
Expand Down Expand Up @@ -2140,9 +2140,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
.option("encoding", "UTF-16")
.json(testFile(fileName))

checkAnswer(jsonDF, Seq(
Row("Chris", "Baird"), Row("Doug", "Rood")
))
checkAnswer(jsonDF, Seq(Row("Chris", "Baird"), Row("Doug", "Rood")))
}

test("SPARK-23723: multi-line json in UTF-32BE with BOM") {
Expand Down Expand Up @@ -2207,10 +2205,9 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
}

def checkEncoding(
expectedEncoding: String,
pathToJsonFiles: String,
expectedContent: String
): Unit = {
expectedEncoding: String,
pathToJsonFiles: String,
expectedContent: String): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be

def checkEncoding(
    expectedEncoding: String,
    pathToJsonFiles: String,
    expectedContent: String): Unit = {

per https://github.com/databricks/scala-style-guide#spacing-and-indentation

or

def checkEncoding(
    expectedEncoding: String, pathToJsonFiles: String, expectedContent: String): Unit = {

if it fits per databricks/scala-style-guide#58 (comment)

Not a big deal

val jsonFiles = new File(pathToJsonFiles)
.listFiles()
.filter(_.isFile)
Expand Down Expand Up @@ -2288,13 +2285,8 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
}
}

def checkReadJson(
lineSep: String,
encodingOption: String,
encoding: String,
inferSchema: Boolean,
runId: Int
): Unit = {
def checkReadJson(lineSep: String, encodingOption: String, encoding: String,
inferSchema: Boolean, runId: Int): Unit = {
test(s"SPARK-23724: checks reading json in ${encoding} #${runId}") {
val lineSepInBytes = {
if (lineSep.startsWith("x")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val delimInBytes = if {
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??? would you mind if I ask to elaborate this one?

Expand Down