custom dataset loader #324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

ArshdeepSekhon wants to merge 36 commits into QData:master from ArshdeepSekhon:custom_dataset

Collaborator

ArshdeepSekhon commented Nov 4, 2020

loads a user specified dataset from local file as text attack dataset using hugging face data loader utils:

Adds a CustomDataset class to textattack datasets module
Converts a user specified dataset into a text attack dataset using hugging face data loader utils
Allows user to specify input and output columns from a local file/ pandas data frame/ dict
accordingly changes the augment command line from csv reader to CustomDataset

ArshdeepSekhon added 10 commits

October 29, 2020 18:22


          custom data loader

f8287f0


          custom textattack dataset from local files or in memory using hugging…

bb1e021

…face data loading


          load user dataset from local files and convert to TextAttack dataset …

9195e1e

…using hugging face data loaders


          load user dataset from local files and convert to TextAttack dataset …

c1bd607

…using huggingface dataloaders


          load user dataset from local files and convert to textattack dataset …

157bd21

…using hugging face data loaders


          load user dataset from local files and convert to textattack dataset …

3edd74b

…using hugging face data loaders


          custom dataset: add attribute error

29b0d9a


          custom dataset: remove stray prints

ea15f9a


          fix output column for custom dataset

34b02ec


          custom dataset: add support for dict

af379af

Member

qiyanjun commented Nov 4, 2020

@ArshdeepSekhon is this a continued PR from 310? If yes, I will close that one.

@ArshdeepSekhon there are some issues with the flair library dependency, so we fix that in the #322


          custom dataset: checks

6e07bd5

Collaborator Author

ArshdeepSekhon commented Nov 4, 2020

@qiyanjun This is a different PR , #310 adds an additional command to let the user specify eval on an entire dataset instead of sampling a subset.

jinyongyoo requested changes

View reviewed changes

Collaborator

jinyongyoo left a comment •

edited

Loading

I didn't know datasets offered an easy way to load datasets from file, so this CustomDataset class was cool to see! I made some minor comments about handling some edge cases.

The biggest comment that I have is that after reading the code for this class, I'm still not completely sure how to use it. I think part of it can be resolved with more detailed documentation. But also, I think part of it is that we're abstracting too much away from the users. We are coupling CustomDataset so tightly with Huggingface's dataset package. Part of me feels like it would easier for users if we can just require users to create the appropriate datasets.Dataset object and pass it directly instead of requiring them to provide the arguments that are eventually used to build the datasets.Dataset object. After all, Huggingface has better documentation and examples for guiding users how to build the dataset. Then, we can simply skip most of the complicated logic in __init__ and just focus on packing the examples as OrderedDict for TextAttack.

I think we should aim for two goals: (1) Custom class for Huggingface datasets.Dataset, (2) Custom dataset class that is more flexible and designed to work with Python lists, dict, and PyTorch Dataset class without being coupled with Huggingface's datasets. Supporting (2) would allow users who are familiar with PyTorch dataloading to skip the process of learning Huggingface's API. Also, for (2), I don't think we should worry about how to process CSV files, JSON files, etc. Opening and reading files should be done by the user who owns the files.

textattack/datasets/custom_dataset.py Outdated



		class CustomDataset(TextAttackDataset):
		"""Loads a Custom Dataset from a file/list of files and prepares it as a

Collaborator

jinyongyoo Nov 5, 2020

Seems like users can pass int a list of files (or a dictionary of files where keys are the split name (e.g. "train", "test"). But the documentation for name argument suggests that it's just a string.

textattack/datasets/custom_dataset.py Outdated

+                  - label_map: Mapping if output labels should be re-mapped. Useful
+                    if model was trained with a different label arrangement than
+                    provided in the ``datasets`` version of the dataset.
+                  - output_scale_factor (float): Factor to divide ground-truth outputs by.

Collaborator

jinyongyoo Nov 5, 2020

I'm confused what output_scale_factor is for. Can you maybe give an example of when this would be used?

textattack/datasets/custom_dataset.py Outdated

+                      infile_format=None,
+                      split=None,
+                      label_map=None,
+                      subset=None,

Collaborator

jinyongyoo Nov 5, 2020

I don't see subset argument being used anywhere in the code when loading the examples (other than in logging part). I think subset is used in HuggingFaceDataset because datasets like glue has subsets (e.g. sst2), but I don't see the point here. Do you think it is redundant?

textattack/datasets/custom_dataset.py Outdated

+                      # if no split in custom data, default split is None
+                      # if user hasn't specified a split to use, raise error if the dataset has splits
+                      if split is None:
+                          if set(self._dataset.keys()) <= set(["train", "validation", "test"]):

Collaborator

jinyongyoo Nov 5, 2020

Two cases pop into my mind:

What happens if set(self._dataset.keys()) is empty? Then, the condition is true and will raise the ValueError. But what if the CSV/JSON file we loaded is for train split and thus self._dataset doesn't really have any concept of "splits"?
What happens if set(self._dataset.keys()) don't have split names that are explicitly train, validation, and test? Maybe the user could have done passed in train, val, test, which would cause the condition to be False.

Collaborator

jinyongyoo Nov 5, 2020

One solution could be that if it there are splits (e.g. set(self._dataset.keys()) is True), we require split to be not be None.

textattack/datasets/custom_dataset.py Outdated

+                  def __init__(
+                      self,
+                      name,
+                      infile_format=None,

Collaborator

jinyongyoo Nov 5, 2020

I think file_format is an easier term to remember.

textattack/datasets/custom_dataset.py Outdated

+                          dataset_columns = []
+                          dataset_columns.append(self._dataset.column_names)
+                      if not set(dataset_columns[0]) <= set(self._dataset.column_names):

Collaborator

jinyongyoo Nov 6, 2020

Maybe I'm just not familiar with how datasets work, but does this mean first column of the data should always be input data and the second column is output?

My main concern is that this would be too restrictive for users. For example, NLI datasets have two input columns "premise" and "hypothesis".

textattack/datasets/custom_dataset.py Outdated

+                  - name(Union[str, dict, pd.DataFrame]): the user specified dataset file names, dicts or pandas dataframe
+                  - infile_format(str): Specifies type of file for loading HuggingFaceDataset : csv, json, pandas, text
+                    from local_files will be loaded as ``datasets.load_dataset(filetype, data_files=name)``.
+                  - label_map: Mapping if output labels should be re-mapped. Useful

Collaborator

jinyongyoo Nov 6, 2020

Could you include an example of how an user should define this label_map (also is this a dict)? For example, should it look like {"Positive": 1, "Negative": 0}?

textattack/datasets/custom_dataset.py Outdated

+                      if shuffle:
+                          random.shuffle(self.examples)
+                  def _format_raw_example(self, raw_example):

Collaborator

jinyongyoo Nov 6, 2020

Since the last three methods are just copied from HuggingFaceDataset, would it be more appropriate to just inherit from HuggingFaceDataset?

textattack/datasets/custom_dataset.py Outdated

+                      if dataset_columns is None:
+                          # automatically infer from dataset
+                          dataset_columns = []
+                          dataset_columns.append(self._dataset.column_names)

Collaborator

jinyongyoo Nov 6, 2020

It seems that self._dataset.column_names is a list. Why are we appending it to another list instead of setting dataset_columns = self._dataset.column_names?

textattack/datasets/custom_dataset.py Outdated

+                  """Loads a Custom Dataset from a file/list of files and prepares it as a
+                  TextAttack dataset.
+                  - name(Union[str, dict, pd.DataFrame]): the user specified dataset file names, dicts or pandas dataframe

Collaborator

jinyongyoo Nov 6, 2020

Don't you think name argument would be too confusing? Maybe file_name_or_data?

Member

qiyanjun commented Nov 7, 2020

@ArshdeepSekhon It is a good idea to sync your copy of the code with the master repository regularly. This way you can quickly account for changes:

$ git remote add upstream https://github.com/QData/TextAttack
$ git fetch upstream
$ git rebase upstream/master

ArshdeepSekhon added 15 commits

November 24, 2020 18:12


          option to test on entire dataset

2105de2


          eval on entire dataset, checks

5f9a4c2


          fix failed checks

f238449


          custom data loader

2f00e33


          custom textattack dataset from local files or in memory using hugging…

793dbe0

…face data loading


          load user dataset from local files and convert to TextAttack dataset …

ae1c1f0

…using hugging face data loaders


          load user dataset from local files and convert to TextAttack dataset …

799f29e

…using huggingface dataloaders


          load user dataset from local files and convert to textattack dataset …

97ea615

…using hugging face data loaders


          load user dataset from local files and convert to textattack dataset …

6172e24

…using hugging face data loaders


          custom dataset: add attribute error

d3e4269


          custom dataset: remove stray prints

92a54a5


          fix output column for custom dataset

7b167ca


          custom dataset: add support for dict

601371d


          custom dataset: checks

9d0ed54


          skeleton code for custom dataset

12aab83


          Merge branch 'custom_dataset' of https://github.com/ArshdeepSekhon/Te…

474bfa7

…xtAttack into custom_dataset

Member

qiyanjun commented Nov 25, 2020 •

edited

Loading

@ArshdeepSekhon please run

"black textattack" to correct all formatting issues first.
"isort textattack" to correct import formatting issues then.
"flake8 textattack" to correct style formatting issues then.
"make test" then


          add utils for reading from files

7f746d1

qiyanjun mentioned this pull request

changes to eval cli : option to eval on entire dataset #310

Merged


          add support for reading from csv, df, txt

7d91be2

Member

qiyanjun commented Dec 4, 2020

@ArshdeepSekhon please fix simple formatting issues at least.

ArshdeepSekhon and others added 5 commits

December 4, 2020 12:44


          fix format errors

7d2f976


          update the confusing word"Successes" to "True Positive/Positive"


          update the confusing uses of "Successes" to "True Positive/Positive"

5c172b2


          Merge branch 'master' into custom_dataset

11d2930


          black,isort formatting

36c83b3

Member

qiyanjun commented Dec 5, 2020

@ArshdeepSekhon let me fix the build doc


          Update dataset.py

f6fb8c5

Member

qiyanjun commented Dec 5, 2020

@ArshdeepSekhon @jinyongyoo I thought datasets/dataset.py should be @AbstractClass with some must @AbstractMethod ..

import abc

class AbstractClass(metaclass=abc.ABCMeta):

    @abc.abstractmethod
    def abstractMethod(self):
        return


          fix a wrong typo

41c5ef5

Member

qiyanjun commented Dec 5, 2020

@ArshdeepSekhon @jinyongyoo I recommend we triangle this PR for now.. Agreed?

qiyanjun closed this

qiyanjun reopened this

qiyanjun closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet