Skip to content

Conversation

@ArshdeepSekhon
Copy link
Collaborator

loads a user specified dataset from local file as text attack dataset using hugging face data loader utils:

  1. Adds a CustomDataset class to textattack datasets module
  2. Converts a user specified dataset into a text attack dataset using hugging face data loader utils
  3. Allows user to specify input and output columns from a local file/ pandas data frame/ dict
  4. accordingly changes the augment command line from csv reader to CustomDataset

@qiyanjun
Copy link
Member

qiyanjun commented Nov 4, 2020

@ArshdeepSekhon is this a continued PR from 310? If yes, I will close that one.

@ArshdeepSekhon there are some issues with the flair library dependency, so we fix that in the #322

@ArshdeepSekhon
Copy link
Collaborator Author

@qiyanjun This is a different PR , #310 adds an additional command to let the user specify eval on an entire dataset instead of sampling a subset.

Copy link
Collaborator

@jinyongyoo jinyongyoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know datasets offered an easy way to load datasets from file, so this CustomDataset class was cool to see! I made some minor comments about handling some edge cases.

The biggest comment that I have is that after reading the code for this class, I'm still not completely sure how to use it. I think part of it can be resolved with more detailed documentation. But also, I think part of it is that we're abstracting too much away from the users. We are coupling CustomDataset so tightly with Huggingface's dataset package. Part of me feels like it would easier for users if we can just require users to create the appropriate datasets.Dataset object and pass it directly instead of requiring them to provide the arguments that are eventually used to build the datasets.Dataset object. After all, Huggingface has better documentation and examples for guiding users how to build the dataset. Then, we can simply skip most of the complicated logic in __init__ and just focus on packing the examples as OrderedDict for TextAttack.

I think we should aim for two goals: (1) Custom class for Huggingface datasets.Dataset, (2) Custom dataset class that is more flexible and designed to work with Python lists, dict, and PyTorch Dataset class without being coupled with Huggingface's datasets. Supporting (2) would allow users who are familiar with PyTorch dataloading to skip the process of learning Huggingface's API. Also, for (2), I don't think we should worry about how to process CSV files, JSON files, etc. Opening and reading files should be done by the user who owns the files.



class CustomDataset(TextAttackDataset):
"""Loads a Custom Dataset from a file/list of files and prepares it as a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like users can pass int a list of files (or a dictionary of files where keys are the split name (e.g. "train", "test"). But the documentation for name argument suggests that it's just a string.

- label_map: Mapping if output labels should be re-mapped. Useful
if model was trained with a different label arrangement than
provided in the ``datasets`` version of the dataset.
- output_scale_factor (float): Factor to divide ground-truth outputs by.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused what output_scale_factor is for. Can you maybe give an example of when this would be used?

infile_format=None,
split=None,
label_map=None,
subset=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see subset argument being used anywhere in the code when loading the examples (other than in logging part). I think subset is used in HuggingFaceDataset because datasets like glue has subsets (e.g. sst2), but I don't see the point here. Do you think it is redundant?

# if no split in custom data, default split is None
# if user hasn't specified a split to use, raise error if the dataset has splits
if split is None:
if set(self._dataset.keys()) <= set(["train", "validation", "test"]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two cases pop into my mind:

  1. What happens if set(self._dataset.keys()) is empty? Then, the condition is true and will raise the ValueError. But what if the CSV/JSON file we loaded is for train split and thus self._dataset doesn't really have any concept of "splits"?
  2. What happens if set(self._dataset.keys()) don't have split names that are explicitly train, validation, and test? Maybe the user could have done passed in train, val, test, which would cause the condition to be False.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution could be that if it there are splits (e.g. set(self._dataset.keys()) is True), we require split to be not be None.

def __init__(
self,
name,
infile_format=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think file_format is an easier term to remember.

dataset_columns = []
dataset_columns.append(self._dataset.column_names)

if not set(dataset_columns[0]) <= set(self._dataset.column_names):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm just not familiar with how datasets work, but does this mean first column of the data should always be input data and the second column is output?

My main concern is that this would be too restrictive for users. For example, NLI datasets have two input columns "premise" and "hypothesis".

- name(Union[str, dict, pd.DataFrame]): the user specified dataset file names, dicts or pandas dataframe
- infile_format(str): Specifies type of file for loading HuggingFaceDataset : csv, json, pandas, text
from local_files will be loaded as ``datasets.load_dataset(filetype, data_files=name)``.
- label_map: Mapping if output labels should be re-mapped. Useful
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include an example of how an user should define this label_map (also is this a dict)? For example, should it look like {"Positive": 1, "Negative": 0}?

if shuffle:
random.shuffle(self.examples)

def _format_raw_example(self, raw_example):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the last three methods are just copied from HuggingFaceDataset, would it be more appropriate to just inherit from HuggingFaceDataset?

if dataset_columns is None:
# automatically infer from dataset
dataset_columns = []
dataset_columns.append(self._dataset.column_names)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that self._dataset.column_names is a list. Why are we appending it to another list instead of setting dataset_columns = self._dataset.column_names?

"""Loads a Custom Dataset from a file/list of files and prepares it as a
TextAttack dataset.

- name(Union[str, dict, pd.DataFrame]): the user specified dataset file names, dicts or pandas dataframe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think name argument would be too confusing? Maybe file_name_or_data?

@qiyanjun
Copy link
Member

qiyanjun commented Nov 7, 2020

@ArshdeepSekhon It is a good idea to sync your copy of the code with the master repository regularly. This way you can quickly account for changes:

$ git remote add upstream https://github.com/QData/TextAttack
$ git fetch upstream
$ git rebase upstream/master

@qiyanjun
Copy link
Member

qiyanjun commented Nov 25, 2020

@ArshdeepSekhon please run

  1. "black textattack" to correct all formatting issues first.
  2. "isort textattack" to correct import formatting issues then.
  3. "flake8 textattack" to correct style formatting issues then.
  4. "make test" then

@qiyanjun
Copy link
Member

qiyanjun commented Dec 4, 2020

@ArshdeepSekhon please fix simple formatting issues at least.

@qiyanjun
Copy link
Member

qiyanjun commented Dec 5, 2020

@ArshdeepSekhon let me fix the build doc

@qiyanjun
Copy link
Member

qiyanjun commented Dec 5, 2020

@ArshdeepSekhon @jinyongyoo I thought datasets/dataset.py should be @AbstractClass with some must @AbstractMethod ..

import abc

class AbstractClass(metaclass=abc.ABCMeta):

    @abc.abstractmethod
    def abstractMethod(self):
        return

@qiyanjun
Copy link
Member

qiyanjun commented Dec 5, 2020

@ArshdeepSekhon @jinyongyoo I recommend we triangle this PR for now.. Agreed?

@qiyanjun qiyanjun closed this Dec 5, 2020
@qiyanjun qiyanjun reopened this Dec 5, 2020
@qiyanjun qiyanjun closed this Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants