Skip to content

Conversation

@Hanyu-Liu-123
Copy link
Collaborator

What does this PR do?

Summary

When augmenting from CSV files, we use the CSV sniffer to try automatically inferring the correct CSV format. This works most of the time but does not function properly when there're more than 2 double quotation marks within a line, due to the fact that double quotation marks are recognized as escape characters by CSV sniffer.

The workaround first goes through the input CSV file, replaces double quotation marks with single quotation marks, and then creates a temporary CSV file to store the modified inputs. We need to create the temp file because we couldn't directly provide the modified inputs (list of strings) to the CSV sniffer, so the modified lines need to be saved to a CSV file and that file will be read by the sniffer.

Users will be warned on which lines of the original input file are modified.

Additions

  • csv_reader_workaround()

@Hanyu-Liu-123 Hanyu-Liu-123 linked an issue Nov 17, 2021 that may be closed by this pull request
@Hanyu-Liu-123 Hanyu-Liu-123 self-assigned this Nov 17, 2021
@srujanjoshi srujanjoshi self-assigned this Feb 13, 2022
@Hanyu-Liu-123
Copy link
Collaborator Author

Hanyu-Liu-123 commented Feb 13, 2022

@srujanjoshi @donggrant The issue I was trying to solve with the PR is that unwanted quotation marks will be added to the output csv file if there’re double quotation marks within a sentence. It comes in 2 folds:

suppose we have the below csv file:

text, label
Today is a "good" day, 1

it will return something like:

text, label
Today is a ""fantastic"" day, 1

because double quotation mark is the escape character for csv files, a double quotation mark needs a double quotation mark when writen to the csv file.

  1. the second problem is when the sentence in the input file is wrapped in double quotation marks like the following:

text, label
"Today is a "good" day", 1

it will return something like:

text, label
"Today is a fantastic"" day""", 1

which is due to how the csv sniffer handles the input file. It has this minor glitch when double quotation marks are found inside a sentence already wrapped in double quotation marks. When it reads "Today is a "good" day" from the input file, it recognizes it as Today is a good" day" (the first 2 quotation marks are removed and the rest remains). Then after the augmentation extra quotation marks are added to those remaining quotation marks so the output is "Today is a fantastic"" day"""

I wasn’t able to find a way to fix the csv sniffer, so I thought we could just preprocess the input file by replacing the double quotation marks with single quotation marks. But this turned out to be a messy solution because we need to create a temp file to store the processed input for the csv sniffer, overwriting the original input file may be risky.

@donggrant
Copy link

Some more cases with the bug:

Input Output
Today is a good day Today is a good day
Today is a "good" day "Today is a ""good"" day"
“Today is a good day” Today is a good day
""Today is a "good" day"" "Today is a ""good"" day"""""
"Today is a "good" day" "Today is a good"" day"""
Today is a "good" day" "Today is a ""good"" day"""
Today "is a "good" day" "Today ""is a ""good"" day"""

@qiyanjun
Copy link
Member

see #623 for a newer solution

@qiyanjun qiyanjun closed this Mar 13, 2022
@jxmorris12 jxmorris12 deleted the fix-extra-quotation-marks- branch June 8, 2022 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When augmenting, extra quotation marks are added automatically

5 participants