Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Jun 21, 2016

What changes were proposed in this pull request?

This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.

The error log looks something like

16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.

Closes #12173

How was this patch tested?

Manually tested.

@rxin
Copy link
Contributor Author

rxin commented Jun 21, 2016

cc @maropu

}
}

def parseCsv(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was dead code

@maropu
Copy link
Member

maropu commented Jun 21, 2016

@rxin okay, lgtm.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60896 has finished for PR 13795 at commit 05b130a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor Author

rxin commented Jun 21, 2016

Merging in master/2.0.

@asfgit asfgit closed this in c775bf0 Jun 21, 2016
asfgit pushed a commit that referenced this pull request Jun 21, 2016
## What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.

The error log looks something like
```
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
```

Closes #12173

## How was this patch tested?
Manually tested.

Author: Reynold Xin <[email protected]>

Closes #13795 from rxin/SPARK-13792.

(cherry picked from commit c775bf0)
Signed-off-by: Reynold Xin <[email protected]>
log for each partition. Malformed records beyond this
number will be ignored. If None is set, it
uses the default value, ``10``.
:param mode: allows a mode for dealing with corrupt records during parsing. If None is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this maxMalformedLogPerPartition need to be added to L412, self._set_csv_opts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is right!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, my bad...

asfgit pushed a commit that referenced this pull request Jun 21, 2016
## What changes were proposed in this pull request?
This is a follow-up to #13795 to properly set CSV options in Python API. As part of this, I also make the Python option setting for both CSV and JSON more robust against positional errors.

## How was this patch tested?
N/A

Author: Reynold Xin <[email protected]>

Closes #13800 from rxin/SPARK-13792-2.

(cherry picked from commit 9333880)
Signed-off-by: Reynold Xin <[email protected]>
ghost pushed a commit to dbtsai/spark that referenced this pull request Jun 21, 2016
## What changes were proposed in this pull request?
This is a follow-up to apache#13795 to properly set CSV options in Python API. As part of this, I also make the Python option setting for both CSV and JSON more robust against positional errors.

## How was this patch tested?
N/A

Author: Reynold Xin <[email protected]>

Closes apache#13800 from rxin/SPARK-13792-2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants