Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Apr 5, 2016

What changes were proposed in this pull request?

Currently in PERMISSIVE and DROPMALFORMED modes we log any record that is going to be ignored. This can generate a lot of logs with large datasets. This pr is to log the parts of malformed records and the number of subsequent records for each partition.
This adds two options as follows;

sqlContext.read
  .format("csv")
  .option("mode", "COUNTMALFORMED")
  .option("maxStoredMalformedPerPartition", 3)
  .load("test.csv").show

A logging message is;

16/04/05 16:42:12 WARN CSVRelation: # of total malformed lines: 25
3 malformed lines extracted and listed as follows;
ab ccc ddd ddd
ab ccc ddd ddd
...

How was this patch tested?

Manual tests done

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #54969 has finished for PR 12173 at commit f458080.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 5, 2016

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #54970 has finished for PR 12173 at commit f458080.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 5, 2016

@falaki Could you check this to match your intent?

@falaki
Copy link
Contributor

falaki commented Apr 5, 2016

I don't think we need a separate parsing mode for this feature. This feature can be part of all existing modes.

@maropu
Copy link
Member Author

maropu commented Apr 5, 2016

okay and fixed. Please re-check again? cc: @falaki

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55054 has finished for PR 12173 at commit 53a95ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55051 has finished for PR 12173 at commit 2d6be0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be private[csv]

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55567 has finished for PR 12173 at commit 01795de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55569 has finished for PR 12173 at commit bd68106.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55582 has finished for PR 12173 at commit b8dd628.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55598 has finished for PR 12173 at commit a3809ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 12, 2016

okay, fixed cc: @falaki

@maropu
Copy link
Member Author

maropu commented Apr 14, 2016

@falaki ping

1 similar comment
@maropu
Copy link
Member Author

maropu commented Apr 22, 2016

@falaki ping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to log this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This output seems to be noisy to me, so I'll remove this.

@falaki
Copy link
Contributor

falaki commented Apr 30, 2016

@maropu sorry for delayed response. I seem to have missed the github notification on this.

@maropu
Copy link
Member Author

maropu commented May 1, 2016

@falaki okay, I'll fix in a day. thanks!

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57497 has finished for PR 12173 at commit 4f76109.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57499 has finished for PR 12173 at commit 47cccd1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented May 2, 2016

@falaki okay, fixed.

@SparkQA
Copy link

SparkQA commented May 10, 2016

Test build #58193 has finished for PR 12173 at commit f2e5f3a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented May 10, 2016

@falaki ping

@maropu
Copy link
Member Author

maropu commented May 13, 2016

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented May 13, 2016

Test build #58551 has finished for PR 12173 at commit 0060d6a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 2, 2016

@falaki ping

@SparkQA
Copy link

SparkQA commented Jun 2, 2016

Test build #59827 has finished for PR 12173 at commit d517447.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 2, 2016

Test build #59829 has finished for PR 12173 at commit b242a62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 4, 2016

@falaki ping

@maropu
Copy link
Member Author

maropu commented Jun 11, 2016

@rxin @falaki ping

@rxin
Copy link
Contributor

rxin commented Jun 20, 2016

@maropu thanks for working on this. One thing I don't get is that why we are keeping the malformed lines in memory? Can't we just track a counter and stop logging once the number is greater than a threshold?

@maropu
Copy link
Member Author

maropu commented Jun 20, 2016

@rxin yea, the current implementation only holds malformedLineNum malformed lines on memory: https://github.com/apache/spark/pull/12173/files#diff-18b09be18156e81f965df293a2781aefR31
Anything I misunderstood?

*/
private[csv] class MalformedLinesInfo(maxStoreMalformed: Int) extends Serializable {

var malformedLines = new ArrayBuffer[String]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The some logs of malformed lines seem to be useful for users to understand why this is a malformed one. However, removing this is okay to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just log them as we find them, can't we? No need to store them in memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see. okay, I'll fix this.

}

// Appends partition values
val fullOutput = requiredSchema.toAttributes ++ partitionSchema.toAttributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change related to the issue you are fixing?

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60892 has finished for PR 12173 at commit fb6473f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in c775bf0 Jun 21, 2016
asfgit pushed a commit that referenced this pull request Jun 21, 2016
## What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.

The error log looks something like
```
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
```

Closes #12173

## How was this patch tested?
Manually tested.

Author: Reynold Xin <[email protected]>

Closes #13795 from rxin/SPARK-13792.

(cherry picked from commit c775bf0)
Signed-off-by: Reynold Xin <[email protected]>
@maropu maropu deleted the SPARK-13792 branch July 5, 2017 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants