-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files #20937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
105 commits
Select commit
Hold shift + click to select a range
b2e92b4
Test for reading json in UTF-16 with BOM
MaxGekk cb2f27b
Use user's charset or autodetect it if the charset is not specified
MaxGekk 0d45fd3
Added a type and a comment for charset
MaxGekk 1fb9b32
Replacing the monadic chaining by matching because it is more readable
MaxGekk c3b04ee
Keeping the old method for backward compatibility
MaxGekk 93d3879
testFile is moved into the test to make more local because it is used…
MaxGekk 15798a1
Adding the charset as third parameter to the text method
MaxGekk cc05ce9
Removing whitespaces at the end of the line
MaxGekk 74f2026
Fix the comment in javadoc style
MaxGekk 4856b8e
Simplifying of the UTF-16 test
MaxGekk 084f41f
A hint to the exception how to set the charset explicitly
MaxGekk 31cd793
Fix for scala style checks
MaxGekk 6eacd18
Run tests again
MaxGekk 3b4a509
Improving of the exception message
MaxGekk cd1124e
Appended the original message to the exception
MaxGekk ebf5390
Multi-line reading of json file in utf-32
MaxGekk c5b6a35
Autodetect charset of jsons in the multiline mode
MaxGekk ef5e6c6
Test for reading a json in UTF-16LE in the multiline mode by using us…
MaxGekk f9b6ad1
Fix test: rename the test file - utf32be -> utf32BE
MaxGekk 3b7714c
Fix code style
MaxGekk edb9167
Appending the create verb to the method for readability
MaxGekk 5ba2881
Making the createParser as a separate private method
MaxGekk 1509e10
Fix code style
MaxGekk e3184b3
Checks the charset option is supported
MaxGekk 87d259c
Support charset as a parameter of the json method
MaxGekk 76c1d08
Test for charset different from utf-8
MaxGekk 88395b5
Description of the charset option of the json method
MaxGekk f2f8ae7
Minor changes in comments: added . at the end of a sentence
MaxGekk b451a03
Added a test for wrong charset name
MaxGekk c13c159
Testing that charset in any case is acceptable
MaxGekk 1cb3ac0
Test: user specified wrong (but supported) charset
MaxGekk 108e8e7
Set charset as an option
MaxGekk 0d20cc6
Test: saving to json in UTF-32BE
MaxGekk 54baf9f
Taking user's charset for saved json
MaxGekk 1d50d94
Test: output charset is UTF-8 by default
MaxGekk bb53798
Changing the readJsonFiles method for readability
MaxGekk 961b482
The test checks that json written by Spark can be read back
MaxGekk a794988
Adding the delimiter option encoded in base64
MaxGekk dccdaa2
Separator encoded as a sequence of bytes in hex
MaxGekk d0abab7
Refactoring: removed unused imports and renaming a parameter
MaxGekk 6741796
The sep option is renamed to recordSeparator. The supported format is…
MaxGekk e4faae1
Renaming recordSeparator to recordDelimiter
MaxGekk 01f4ef5
Comments for the recordDelimiter option
MaxGekk 24cedb9
Support other formats of recordDelimiter
MaxGekk d40dda2
Checking different charsets and record delimiters
MaxGekk ad6496c
Renaming test's method to make it more readable
MaxGekk 358863d
Test of reading json in different charsets and delimiters
MaxGekk 7e5be5e
Fix inferring of csv schema for any charsets
MaxGekk d138d2d
Fix errors of scalastyle check
MaxGekk c26ef5d
Reserving format for regular expressions and concatenated json
MaxGekk 5f0b069
Fix recordDelimiter tests
MaxGekk ef8248f
Additional cases are added to the delimiter test
MaxGekk 2efac08
Renaming recordDelimiter to lineSeparator
MaxGekk b2020fa
Adding HyukjinKwon changes
MaxGekk f99c1e1
Revert lineSepInWrite back to lineSep
MaxGekk 6d13d00
Merge remote-tracking branch 'origin/master' into json-line-sep
MaxGekk 77112ef
Fix passing of the lineSeparator to HadoopFileLinesReader
MaxGekk d632706
Fix python style checking
MaxGekk bbff402
Fix text source tests and javadoc comments
MaxGekk 3af996b
Merge branch 'json-charset' into json-charset-record-delimiter
MaxGekk 8253811
Merge branch 'json-line-sep' into json-charset-record-delimiter
MaxGekk ab8210c
Getting UTF-8 as default charset for lineSep
MaxGekk 7c6f115
Set charset different from UTF-8 in the test
MaxGekk f553b07
Fix for the charset test: charset wasn't specified
MaxGekk d6a07a1
Removing line leaved after merge
MaxGekk cb12ea3
Removing flexible format for lineSep
MaxGekk eb2965b
Adding ticket number to test titles
MaxGekk 7a4865c
Making comments more precise
MaxGekk dbeb0c1
lineSep must be specified if charset is different from UTF-8
MaxGekk ac67020
Support encoding as a synonym for the charset option
MaxGekk d96b720
Merge remote-tracking branch 'origin/master' into json-encoding-line-sep
MaxGekk 75f7bb6
Fix missing require and specifying field of internal row explicitly
MaxGekk d93dcdc
Making the doc generator happy
MaxGekk 65b4b73
Making the encoding name as the primary name
MaxGekk 6b52419
Blacklisting UTF-16 and UTF-32 in per-line mode
MaxGekk 6116bac
Changes after code review
MaxGekk 5383400
Renaming charset to encoding
MaxGekk 1aeae3c
Changes requested by HyukjinKwon in the review
MaxGekk 7e20891
Adding tests for SPARK-23094
MaxGekk 0d3ed3c
Fix comments
MaxGekk 5d5c295
Matching by encoding per each line is eliminated
MaxGekk e7be77d
Addressing Hyukjin's review comments
MaxGekk 6bd841a
Fixes regarding to coding style
MaxGekk 6a62679
Making lineSep as opt string
MaxGekk 3b30ce0
Removing option name in a test
MaxGekk fcd0a21
Merge branch 'master' into json-encoding-line-sep
MaxGekk af71324
Addressing HyukjinKwon's review comments
MaxGekk 76dbbed
Merge branch 'json-encoding-line-sep' of github.com:MaxGekk/spark-1 i…
MaxGekk 3207e59
Merge remote-tracking branch 'origin/master' into json-encoding-line-sep
MaxGekk b817184
Making Scala style checker and compiler happy
MaxGekk 15df9af
Merge remote-tracking branch 'origin/master' into json-encoding-line-sep
MaxGekk 36253f4
Adressing Hyukjin Kwon's review comments
MaxGekk aa69559
Adding benchmarks for json reads
MaxGekk c35d5d1
Making Scala style checker happy
MaxGekk 58fc5c6
Eliminate unneeded wrapping by ByteArrayInputStream per-line during s…
MaxGekk 63b5894
Adding benchmarks for wide lines
MaxGekk 1ace082
Making comments shorter
MaxGekk 6c0df03
Removing empty line between spark's imports
MaxGekk b4c0d38
Creating a stream decoder with specific buffer size
MaxGekk f2a259f
Enable all JSON benchmarks
MaxGekk 482b799
Addressing Hyukjin Kwon's review comments
MaxGekk a0ab98b
Addressing Wenchen Fan's review comments
MaxGekk a7be182
Merge branch 'json-encoding-line-sep' of github.com:MaxGekk/spark-1 i…
MaxGekk e0cebf4
Merge remote-tracking branch 'origin/master' into json-encoding-line-sep
MaxGekk d3d28aa
Addressing Hyukjin Kwon's review comments
MaxGekk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Support other formats of recordDelimiter
- Loading branch information
commit 24cedb9d809b026fa36b01fb2b425918b43857df
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto