-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13792][SQL] Limit logging of bad records in CSV data source #13795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @maropu |
| } | ||
| } | ||
|
|
||
| def parseCsv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was dead code
|
@rxin okay, lgtm. |
|
Test build #60896 has finished for PR 13795 at commit
|
|
Merging in master/2.0. |
## What changes were proposed in this pull request? This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records. The error log looks something like ``` 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged. ``` Closes #12173 ## How was this patch tested? Manually tested. Author: Reynold Xin <[email protected]> Closes #13795 from rxin/SPARK-13792. (cherry picked from commit c775bf0) Signed-off-by: Reynold Xin <[email protected]>
| log for each partition. Malformed records beyond this | ||
| number will be ignored. If None is set, it | ||
| uses the default value, ``10``. | ||
| :param mode: allows a mode for dealing with corrupt records during parsing. If None is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this maxMalformedLogPerPartition need to be added to L412, self._set_csv_opts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, my bad...
## What changes were proposed in this pull request? This is a follow-up to #13795 to properly set CSV options in Python API. As part of this, I also make the Python option setting for both CSV and JSON more robust against positional errors. ## How was this patch tested? N/A Author: Reynold Xin <[email protected]> Closes #13800 from rxin/SPARK-13792-2. (cherry picked from commit 9333880) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? This is a follow-up to apache#13795 to properly set CSV options in Python API. As part of this, I also make the Python option setting for both CSV and JSON more robust against positional errors. ## How was this patch tested? N/A Author: Reynold Xin <[email protected]> Closes apache#13800 from rxin/SPARK-13792-2.
What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.
The error log looks something like
Closes #12173
How was this patch tested?
Manually tested.