Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
446ae98
[WIP] filter out empty/whitespace JSON lines when skipping parsing
Jan 21, 2019
1544771
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 21, 2019
236227f
remove println/dumpStack
Jan 21, 2019
e4d9052
add test for non-parsed JSON count
Jan 21, 2019
105e5bb
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 22, 2019
e8e3189
Merge branch 'json_emptyline_count_test' of github.com:sumitsu/spark …
Jan 22, 2019
13942b8
fix scala import style errors
Jan 22, 2019
5f173d9
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 22, 2019
7a51764
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 22, 2019
91305ee
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 23, 2019
051d84a
push down non-parsed json record filter into FailureSafeParser
Jan 23, 2019
57d2c05
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 24, 2019
4fffe7f
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 25, 2019
2252045
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 26, 2019
532a83d
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 27, 2019
3cae4da
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 27, 2019
cd2f30c
Merge branch 'master' of github.com:apache/spark into json_emptyline_…
Jan 27, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
remove println/dumpStack
  • Loading branch information
Branden Smith committed Jan 21, 2019
commit 236227fab6d5c9bedf696be8a8a071e5df6a3380
Original file line number Diff line number Diff line change
Expand Up @@ -55,16 +55,9 @@ class FailureSafeParser[IN](

def parse(input: IN): Iterator[InternalRow] = {
try {
Thread.dumpStack()
if (skipParsing) {
// scalastyle:off println
println("!!!! NOT PARSING !!!!")
// scalastyle:on println
Iterator.single(InternalRow.empty)
} else {
// scalastyle:off println
println("!!!! PARSING !!!!")
// scalastyle:on println
rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to modify this file? Since this is a issue in json stuffs, I think its better to handle this case in the json parser side. Can't we do handle this in the same way with CSV one?, e.g.,

val filteredLines: Iterator[String] = CSVExprUtils.filterCommentAndEmpty(lines, options)

Copy link
Author

@sumitsu sumitsu Jan 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my earlier revisions worked in the way you suggest (if I've understood your point correctly); I changed it in order to avoid redundant empty-line filtering in the case where the full parser has to run anyway (i.e. where skipParsing == false).

What do you think? Is that a valid optimization, or is it better to do it on the JSON side as in 13942b8 to avoid changes to FailureSafeParser?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this: maropu@f4df907
But, you should follow other guys who are familiar this part, @HyukjinKwon and @MaxGekk

} catch {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -129,20 +129,13 @@ object TextInputJsonDataSource extends JsonDataSource {
}

private def isAllWhitespace(rowText: Text): Boolean = {
val afdsafdsa =
rowText.getLength == 0 || {
val rowTextBuffer = ByteBuffer.wrap(rowText.getBytes)
continually {
val cp = Text.bytesToCodePoint(rowTextBuffer)
print(s"[${Character.getName(cp)}]")
cp
} .takeWhile(_ >= 0).take(rowText.getLength).forall(isWhitespace)
continually(Text.bytesToCodePoint(rowTextBuffer))
.takeWhile(_ >= 0)
.take(rowText.getLength)
.forall(isWhitespace)
}
// scalastyle:off println
println()
println(new String(rowText.getBytes))
// scalastyle:on println
afdsafdsa
}

override def readFile(
Expand Down