Fix extra quotation marks issue #582

Hanyu-Liu-123 · 2021-11-17T14:51:41Z

What does this PR do?

Summary

When augmenting from CSV files, we use the CSV sniffer to try automatically inferring the correct CSV format. This works most of the time but does not function properly when there're more than 2 double quotation marks within a line, due to the fact that double quotation marks are recognized as escape characters by CSV sniffer.

The workaround first goes through the input CSV file, replaces double quotation marks with single quotation marks, and then creates a temporary CSV file to store the modified inputs. We need to create the temp file because we couldn't directly provide the modified inputs (list of strings) to the CSV sniffer, so the modified lines need to be saved to a CSV file and that file will be read by the sniffer.

Users will be warned on which lines of the original input file are modified.

Additions

csv_reader_workaround()

Hanyu-Liu-123 · 2022-02-13T20:04:41Z

@srujanjoshi @donggrant The issue I was trying to solve with the PR is that unwanted quotation marks will be added to the output csv file if there’re double quotation marks within a sentence. It comes in 2 folds:

suppose we have the below csv file：

text, label
Today is a "good" day, 1

it will return something like:

text, label
Today is a ""fantastic"" day, 1

because double quotation mark is the escape character for csv files, a double quotation mark needs a double quotation mark when writen to the csv file.

the second problem is when the sentence in the input file is wrapped in double quotation marks like the following:

text, label
"Today is a "good" day", 1

it will return something like:

text, label
"Today is a fantastic"" day""", 1

which is due to how the csv sniffer handles the input file. It has this minor glitch when double quotation marks are found inside a sentence already wrapped in double quotation marks. When it reads "Today is a "good" day" from the input file, it recognizes it as Today is a good" day" (the first 2 quotation marks are removed and the rest remains). Then after the augmentation extra quotation marks are added to those remaining quotation marks so the output is "Today is a fantastic"" day"""

I wasn’t able to find a way to fix the csv sniffer, so I thought we could just preprocess the input file by replacing the double quotation marks with single quotation marks. But this turned out to be a messy solution because we need to create a temp file to store the processed input for the csv sniffer, overwriting the original input file may be risky.

donggrant · 2022-02-20T20:26:36Z

Some more cases with the bug:

Input Output

Today is a good day Today is a good day

Today is a "good" day "Today is a ""good"" day"

“Today is a good day” Today is a good day

""Today is a "good" day"" "Today is a ""good"" day"""""

"Today is a "good" day" "Today is a good"" day"""

Today is a "good" day" "Today is a ""good"" day"""

Today "is a "good" day" "Today ""is a ""good"" day"""

qiyanjun · 2022-03-13T19:50:32Z

see #623 for a newer solution

add csv reader workaround

f01f65b

Hanyu-Liu-123 linked an issue Nov 17, 2021 that may be closed by this pull request

When augmenting, extra quotation marks are added automatically #431

Closed

Hanyu-Liu-123 self-assigned this Nov 17, 2021

srujanjoshi self-assigned this Feb 13, 2022

qiyanjun closed this Mar 13, 2022

jxmorris12 deleted the fix-extra-quotation-marks- branch June 8, 2022 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix extra quotation marks issue #582

Fix extra quotation marks issue #582

Uh oh!

Hanyu-Liu-123 commented Nov 17, 2021

Uh oh!

Hanyu-Liu-123 commented Feb 13, 2022 •

edited

Loading

Uh oh!

donggrant commented Feb 20, 2022

Uh oh!

qiyanjun commented Mar 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix extra quotation marks issue #582

Fix extra quotation marks issue #582

Uh oh!

Conversation

Hanyu-Liu-123 commented Nov 17, 2021

What does this PR do?

Summary

Additions

Uh oh!

Hanyu-Liu-123 commented Feb 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

donggrant commented Feb 20, 2022

Uh oh!

qiyanjun commented Mar 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Hanyu-Liu-123 commented Feb 13, 2022 •

edited

Loading