-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13792][SQL] Limit logging of bad records in CSVRelation #12173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #54969 has finished for PR 12173 at commit
|
|
Jenkins, retest this please. |
|
Test build #54970 has finished for PR 12173 at commit
|
|
@falaki Could you check this to match your intent? |
|
I don't think we need a separate parsing mode for this feature. This feature can be part of all existing modes. |
|
okay and fixed. Please re-check again? cc: @falaki |
|
Test build #55054 has finished for PR 12173 at commit
|
|
Test build #55051 has finished for PR 12173 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be private[csv]
|
Test build #55567 has finished for PR 12173 at commit
|
|
Test build #55569 has finished for PR 12173 at commit
|
|
Test build #55582 has finished for PR 12173 at commit
|
|
Test build #55598 has finished for PR 12173 at commit
|
|
okay, fixed cc: @falaki |
|
@falaki ping |
1 similar comment
|
@falaki ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still want to log this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This output seems to be noisy to me, so I'll remove this.
|
@maropu sorry for delayed response. I seem to have missed the github notification on this. |
|
@falaki okay, I'll fix in a day. thanks! |
|
Test build #57497 has finished for PR 12173 at commit
|
|
Test build #57499 has finished for PR 12173 at commit
|
|
@falaki okay, fixed. |
|
Test build #58193 has finished for PR 12173 at commit
|
|
@falaki ping |
|
Jenkins, retest this please. |
|
Test build #58551 has finished for PR 12173 at commit
|
|
@falaki ping |
|
Test build #59827 has finished for PR 12173 at commit
|
|
Test build #59829 has finished for PR 12173 at commit
|
|
@falaki ping |
|
@maropu thanks for working on this. One thing I don't get is that why we are keeping the malformed lines in memory? Can't we just track a counter and stop logging once the number is greater than a threshold? |
|
@rxin yea, the current implementation only holds |
| */ | ||
| private[csv] class MalformedLinesInfo(maxStoreMalformed: Int) extends Serializable { | ||
|
|
||
| var malformedLines = new ArrayBuffer[String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The some logs of malformed lines seem to be useful for users to understand why this is a malformed one. However, removing this is okay to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just log them as we find them, can't we? No need to store them in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see. okay, I'll fix this.
| } | ||
|
|
||
| // Appends partition values | ||
| val fullOutput = requiredSchema.toAttributes ++ partitionSchema.toAttributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change related to the issue you are fixing?
|
Test build #60892 has finished for PR 12173 at commit
|
## What changes were proposed in this pull request? This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records. The error log looks something like ``` 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged. ``` Closes #12173 ## How was this patch tested? Manually tested. Author: Reynold Xin <[email protected]> Closes #13795 from rxin/SPARK-13792. (cherry picked from commit c775bf0) Signed-off-by: Reynold Xin <[email protected]>
What changes were proposed in this pull request?
Currently in
PERMISSIVEandDROPMALFORMEDmodes we log any record that is going to be ignored. This can generate a lot of logs with large datasets. This pr is to log the parts of malformed records and the number of subsequent records for each partition.This adds two options as follows;
A logging message is;
How was this patch tested?
Manual tests done