-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23723] New charset option for json datasource #20849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
b2e92b4
Test for reading json in UTF-16 with BOM
MaxGekk cb2f27b
Use user's charset or autodetect it if the charset is not specified
MaxGekk 0d45fd3
Added a type and a comment for charset
MaxGekk 1fb9b32
Replacing the monadic chaining by matching because it is more readable
MaxGekk c3b04ee
Keeping the old method for backward compatibility
MaxGekk 93d3879
testFile is moved into the test to make more local because it is used…
MaxGekk 15798a1
Adding the charset as third parameter to the text method
MaxGekk cc05ce9
Removing whitespaces at the end of the line
MaxGekk 74f2026
Fix the comment in javadoc style
MaxGekk 4856b8e
Simplifying of the UTF-16 test
MaxGekk 084f41f
A hint to the exception how to set the charset explicitly
MaxGekk 31cd793
Fix for scala style checks
MaxGekk 6eacd18
Run tests again
MaxGekk 3b4a509
Improving of the exception message
MaxGekk cd1124e
Appended the original message to the exception
MaxGekk ebf5390
Multi-line reading of json file in utf-32
MaxGekk c5b6a35
Autodetect charset of jsons in the multiline mode
MaxGekk ef5e6c6
Test for reading a json in UTF-16LE in the multiline mode by using us…
MaxGekk f9b6ad1
Fix test: rename the test file - utf32be -> utf32BE
MaxGekk 3b7714c
Fix code style
MaxGekk edb9167
Appending the create verb to the method for readability
MaxGekk 5ba2881
Making the createParser as a separate private method
MaxGekk 1509e10
Fix code style
MaxGekk e3184b3
Checks the charset option is supported
MaxGekk 87d259c
Support charset as a parameter of the json method
MaxGekk 76c1d08
Test for charset different from utf-8
MaxGekk 88395b5
Description of the charset option of the json method
MaxGekk f2f8ae7
Minor changes in comments: added . at the end of a sentence
MaxGekk b451a03
Added a test for wrong charset name
MaxGekk c13c159
Testing that charset in any case is acceptable
MaxGekk 1cb3ac0
Test: user specified wrong (but supported) charset
MaxGekk 108e8e7
Set charset as an option
MaxGekk 0d20cc6
Test: saving to json in UTF-32BE
MaxGekk 54baf9f
Taking user's charset for saved json
MaxGekk 1d50d94
Test: output charset is UTF-8 by default
MaxGekk bb53798
Changing the readJsonFiles method for readability
MaxGekk 961b482
The test checks that json written by Spark can be read back
MaxGekk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Support charset as a parameter of the json method
- Loading branch information
commit 87d259c7d190716a89016c85b7450d471b3481bf
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -176,7 +176,7 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, | |
| allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None, | ||
| allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, | ||
| mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None, | ||
| multiLine=None, allowUnquotedControlChars=None): | ||
| multiLine=None, allowUnquotedControlChars=None, charset=None): | ||
| """ | ||
| Loads JSON files and returns the results as a :class:`DataFrame`. | ||
|
|
||
|
|
@@ -237,6 +237,8 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, | |
| :param allowUnquotedControlChars: allows JSON Strings to contain unquoted control | ||
| characters (ASCII characters with value less than 32, | ||
| including tab and line feed characters) or not. | ||
| :param charset: standard charset name, for example UTF-8, UTF-16 and UTF-32 If None is | ||
| set, the charset of input json will be detected automatically. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we have another test case with an encoding jackson doesn't automatically detect too? |
||
|
|
||
| >>> df1 = spark.read.json('python/test_support/sql/people.json') | ||
| >>> df1.dtypes | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we ues
encodingto be consistent with CSV?charsethad an aliasencodingto look after Pandas and R.